* [RFC PATCH 1/5] futex: add new exclusive lock & unlock command codes
2014-07-21 15:24 [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Waiman Long
@ 2014-07-21 15:24 ` Waiman Long
2014-07-21 16:42 ` Thomas Gleixner
[not found] ` <1405956271-34339-1-git-send-email-Waiman.Long-VXdhtT5mjnY@public.gmane.org>
` (5 subsequent siblings)
6 siblings, 1 reply; 51+ messages in thread
From: Waiman Long @ 2014-07-21 15:24 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens
Cc: linux-kernel, linux-api, linux-doc, Jason Low, Scott J Norton,
Waiman Long
This patch adds two new simple exclusive lock and unlock primitives
or command codes (FUTEX_SPIN_LOCK and FUTEX_SPIN_UNLOCK) to provide
to the users of the futex(2) syscall. These two primitives provide
an abstract for a mutual exclusive lock (mutex).
The semantics is about the same as the PI and robust futexes where
a locker atomically puts its thread ID into the futex word to get
lock and clear it to free the lock.
The major differences between the new primitives and the existing
ones are on how the waiters are enqueued into the waiting queue. The
existing primitives use a priority list to store the waiting tasks
from all the futexes that hashed to the same hash bucket.
The new primitives use two different lists. The first list off the
hash bucket is to store a new dynamically allocated data structure
(futex_q_head) that contains information about all the waiters for
a given futex. So if two futexes hash to the same bucket, the list
will have 2 entries irrespective of the number of tasks contending
the futexes.
The second list is in the futex_q_head contains all the waiters
waiting for the futex to be freed. There are also 2 spinlocks used -
one in the hash bucket and one in the futex_q_head structure.
The main advantages of the 2-list approach is to reduce the amount of
spinlock contention that can happen to a heavily contended futex. It
is not unusually that over 90% of CPU cycles can be spent in the
lock spinning for a heavily contended futex. By making the lock more
granular, the spinlock contention can be reduced.
The disadvantage, however, is the need to dynamically allocate the new
structure. To reduce that overhead, a single-slot cache entry is added
to the futex hash bucket structure to hold a single free futex_q_head
structure. So as long as there is no futex address collision, the futex
code doesn't need to do a lot of memory allocation and deallocation.
With the new primitives, a simple 256-thread mutex micro-benchmark
on a 4-socket 40-core system gave the following perf profile:
6.19% futex_test [kernel.kallsyms] [k] _raw_spin_lock_irqsave
5.17% futex_test [kernel.kallsyms] [k] futex_spin_lock
3.87% futex_test [kernel.kallsyms] [k] _raw_spin_lock
2.72% futex_test [kernel.kallsyms] [k] drop_futex_key_refs
2.46% futex_test [kernel.kallsyms] [k] futex_spin_unlock
2.36% futex_test [kernel.kallsyms] [k] gup_pte_range
1.87% futex_test [kernel.kallsyms] [k] get_user_pages_fast
1.84% futex_test [kernel.kallsyms] [k] __audit_syscall_exit
The corresponding one with the wait-wake futex was as follows:
93.52% futex_test [kernel.kallsyms] [k] _raw_spin_lock
1.14% futex_test [kernel.kallsyms] [k] futex_wait_setup
0.97% futex_test [kernel.kallsyms] [k] futex_wake
In term of the reported system time used on a 10 million mutex
operations, the new primitives used only 12.7s while the old ones
needed 104.4s.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
include/uapi/linux/futex.h | 4 +
kernel/futex.c | 453 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 457 insertions(+), 0 deletions(-)
diff --git a/include/uapi/linux/futex.h b/include/uapi/linux/futex.h
index 0b1f716..58aa4fb 100644
--- a/include/uapi/linux/futex.h
+++ b/include/uapi/linux/futex.h
@@ -20,6 +20,8 @@
#define FUTEX_WAKE_BITSET 10
#define FUTEX_WAIT_REQUEUE_PI 11
#define FUTEX_CMP_REQUEUE_PI 12
+#define FUTEX_SPIN_LOCK 13
+#define FUTEX_SPIN_UNLOCK 14
#define FUTEX_PRIVATE_FLAG 128
#define FUTEX_CLOCK_REALTIME 256
@@ -39,6 +41,8 @@
FUTEX_PRIVATE_FLAG)
#define FUTEX_CMP_REQUEUE_PI_PRIVATE (FUTEX_CMP_REQUEUE_PI | \
FUTEX_PRIVATE_FLAG)
+#define FUTEX_SPIN_LOCK_PRIVATE (FUTEX_SPIN_LOCK | FUTEX_PRIVATE_FLAG)
+#define FUTEX_SPIN_UNLOCK_PRIVATE (FUTEX_SPIN_UNLOCK | FUTEX_PRIVATE_FLAG)
/*
* Support for robust futexes: the kernel cleans up held futexes at
diff --git a/kernel/futex.c b/kernel/futex.c
index b632b5f..ec9b6ee 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -23,6 +23,9 @@
* Copyright (C) IBM Corporation, 2009
* Thanks to Thomas Gleixner for conceptual design and careful reviews.
*
+ * Optimistic-spinning futexes by Waiman Long
+ * Copyright (C) 2014 Hewlett-Packard Development Company, L.P.
+ *
* Thanks to Ben LaHaise for yelling "hashed waitqueues" loudly
* enough at me, Linus for the original (flawed) idea, Matthew
* Kirkwood for proof-of-concept implementation.
@@ -253,6 +256,8 @@ struct futex_hash_bucket {
atomic_t waiters;
spinlock_t lock;
struct plist_head chain;
+ struct list_head oslist; /* Optimistic spinning list */
+ void *qhcache; /* Cache for a free queue head */
} ____cacheline_aligned_in_smp;
static unsigned long __read_mostly futex_hashsize;
@@ -2939,6 +2944,446 @@ void exit_robust_list(struct task_struct *curr)
curr, pip);
}
+/*=========================================
+ * Optimistic Spinning Futexes
+ *=========================================
+ *
+ * A major difference between the optimistic spinning (optspin) futex and a
+ * wait-wake futex is the fact we need a central data structure to hold
+ * information on all the spinning and waiting tasks for a given optspin
+ * futex. We can't use the futex_q structure for that purpose. Instead,
+ * we have to dynamically allocate a new futex_q_head structure to hold that
+ * information.
+ *
+ * To ease implementation, these new futex_q_head structures are queued in a
+ * different list within the futex hash bucket. To reduce the memory allocation
+ * and deallocation overhead, another field is added to the hash bucket to
+ * cache the last freed futex_q_head structure. This improves performance
+ * at the expense of a bit higher memory overhead. As long as there is no futex
+ * address hash collision, there is no need to do a lot of memory allocation
+ * and deallocation for using optspin futex.
+ *
+ * The states of each spinning or waiting task are stored in a futex_q_node
+ * structure allocated from the task's kernel stack, just like futex_q.
+ *
+ * Data structure diagram:
+ *
+ * hash futex futex
+ * table qhead 0 qhead 1
+ * | |
+ * +----+ +------+ +------+
+ * | p |---->| next |---->| next |----> ....
+ * +----+ +------+ +------+
+ * | | | |
+ * |
+ * +------+ +------+
+ * |node 0|---->|node 1|----> ....
+ * +------+ +------+
+ *
+ * Locking
+ * -------
+ * Two spinlocks are used by the optspin futexes - the hash bucket spinlock
+ * and the queue head spinlock.
+ *
+ * The hash bucket spinlock protects against the searching and modification
+ * of the futex queue head list as well as the increment of the atomic locker
+ * count. The decrement of locker count is done outside the lock. When the
+ * count reaches 0, the queue head structure can then be deallocated.
+ *
+ * The queue head spinlock protects the futex wait queue from modification.
+ */
+#define FUTEX_TID(u) (pid_t)((u) & FUTEX_TID_MASK)
+#define FUTEX_HAS_WAITERS(u) ((u) & FUTEX_WAITERS)
+
+/**
+ * struct futex_q_head - head of the optspin futex queue, one per unique key
+ * @hnode: list entry from the hash bucket
+ * @waitq: a list of waiting tasks
+ * @key: the key the futex is hashed on
+ * @wlock: spinlock for managing wait queue
+ * @lcnt: locker count
+ *
+ * Locking sequence
+ * ----------------
+ * 1) Lock hash bucket spinlock, locate the futex queue head
+ * 2) Inc lcnt (lock) or read lcnt (unlock), release hash bucket spinlock
+ * 3) For waiter:
+ * - lock futex queue head spinlock
+ * - enqueue into the wait queue
+ * - release the lock & sleep
+ * 4) For unlocker:
+ * - locate the queue head just like a locker does
+ * - Take the queue head lock and wake up the first waiter there.
+ */
+struct futex_q_head {
+ struct list_head hnode;
+ struct list_head waitq;
+ union futex_key key;
+ pid_t otid;
+ spinlock_t wlock;
+ atomic_t lcnt;
+};
+
+/**
+ * struct futex_q_node - a node in the optspin futex queue
+ * @wnode: a list entry for the waiting queue
+ * @task: task_struct pointer of the current task
+ */
+struct futex_q_node {
+ struct list_head wnode;
+ struct task_struct *task;
+};
+
+
+/*
+ * find_qhead - Find a queue head structure with the matching key
+ *
+ * The caller must hold the hash bucket spinlock.
+ */
+static inline struct futex_q_head *
+find_qhead(struct futex_hash_bucket *hb, union futex_key *key)
+{
+ struct futex_q_head *qh;
+
+ list_for_each_entry(qh, &hb->oslist, hnode)
+ if (match_futex(&qh->key, key))
+ return qh;
+ return NULL;
+}
+
+/*
+ * qhead_alloc_init - allocate and initialize a queue head structure.
+ *
+ * The caller must hold the hash bucket spinlock.
+ *
+ * A one-slot queue head structure cache is added to the futex hash bucket
+ * to minimize overhead due to memory allocation and deallocation under light
+ * contention with no hash bucket collision.
+ */
+static inline struct futex_q_head *
+qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
+{
+ struct futex_q_head *qh = NULL;
+ static const struct futex_q_head qh0 = { { 0 } };
+
+ if (hb->qhcache)
+ qh = xchg(&hb->qhcache, NULL);
+ if (!qh)
+ qh = kmalloc(sizeof(struct futex_q_head), GFP_KERNEL);
+
+ /*
+ * Initialize the queue head structure
+ */
+ if (qh) {
+ *qh = qh0;
+ qh->key = *key;
+ atomic_set(&qh->lcnt, 1);
+ INIT_LIST_HEAD(&qh->waitq);
+ spin_lock_init(&qh->wlock);
+ list_add(&qh->hnode, &hb->oslist);
+ }
+ return qh;
+}
+
+/*
+ * free_qhead - put queue head structure back to the free qh cache or free it
+ */
+static inline void
+qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
+{
+ bool found;
+ struct futex_q_head *qhead;
+
+ /*
+ * Try to free the queue head structure if no one is using it
+ * Must take the hash bucket lock to make sure that no one
+ * else is touching the locker count.
+ */
+ spin_lock(&hb->lock);
+
+ /*
+ * First make sure that qh is still there and unused in the
+ * hashed list.
+ */
+ found = false;
+ list_for_each_entry(qhead, &hb->oslist, hnode)
+ if (qhead == qh) {
+ found = true;
+ break;
+ }
+
+ if (!found || atomic_read(&qh->lcnt)) {
+ spin_unlock(&hb->lock);
+ return;
+ }
+
+ /*
+ * Free the queue head structure
+ */
+ BUG_ON(!list_empty(&qh->waitq));
+ list_del(&qh->hnode);
+ spin_unlock(&hb->lock);
+
+ if (!hb->qhcache && (cmpxchg(&hb->qhcache, NULL, qh) == NULL))
+ return;
+ kfree(qh);
+}
+
+/*
+ * futex_spin_trylock - attempt to take the lock
+ * Return: 1 if successful or an error happen
+ * 0 otherwise
+ *
+ * Side effect: The uval and ret will be updated.
+ */
+static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
+ int *pret, u32 vpid)
+{
+ u32 old;
+
+ *pret = get_futex_value_locked(puval, uaddr);
+ if (*pret)
+ return 1;
+
+ if (FUTEX_TID(*puval))
+ return 0; /* The mutex is not free */
+
+ old = *puval;
+
+ *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
+ if (*pret)
+ return 1;
+ if (*puval == old) {
+ /* Adjust uval to reflect current value */
+ *puval = vpid | old;
+ return 1;
+ }
+ return 0;
+}
+
+/*
+ * futex_spin_lock
+ */
+static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
+{
+ struct futex_hash_bucket *hb;
+ struct futex_q_head *qh = NULL;
+ struct futex_q_node qnode;
+ union futex_key key;
+ bool gotlock;
+ int ret, cnt;
+ u32 uval, vpid, old;
+
+ qnode.task = current;
+ vpid = task_pid_vnr(qnode.task);
+
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+ if (unlikely(ret))
+ return ret;
+
+ hb = hash_futex(&key);
+ spin_lock(&hb->lock);
+
+ /*
+ * Locate the queue head for the given key
+ */
+ qh = find_qhead(hb, &key);
+
+ /*
+ * Check the futex value under the hash bucket lock as it might
+ * be changed.
+ */
+ if (futex_spin_trylock(uaddr, &uval, &ret, vpid))
+ goto hbunlock_out;
+
+ if (!qh) {
+ /*
+ * First waiter:
+ * Allocate a queue head structure & initialize it
+ */
+ qh = qhead_alloc_init(hb, &key);
+ if (unlikely(!qh)) {
+ ret = -ENOMEM;
+ goto hbunlock_out;
+ }
+ } else {
+ atomic_inc(&qh->lcnt);
+ }
+ spin_unlock(&hb->lock);
+
+ /*
+ * Put the task into the wait queue and sleep
+ */
+ preempt_disable();
+ get_task_struct(qnode.task);
+ spin_lock(&qh->wlock);
+ list_add_tail(&qnode.wnode, &qh->waitq);
+ __set_current_state(TASK_INTERRUPTIBLE);
+ spin_unlock(&qh->wlock);
+ gotlock = false;
+ for (;;) {
+ ret = get_user(uval, uaddr);
+ if (ret)
+ break;
+ while ((FUTEX_TID(uval) == 0) || !FUTEX_HAS_WAITERS(uval)) {
+ /*
+ * Make sure the waiter bit is set before sleeping
+ * and get the lock if it is available.
+ */
+ if (FUTEX_TID(uval) == 0) {
+ old = uval;
+ ret = futex_atomic_cmpxchg_inatomic(&uval,
+ uaddr, old, vpid);
+ if (ret) {
+ goto dequeue;
+ } else if (uval == old) {
+ gotlock = true;
+ goto dequeue;
+ }
+ continue;
+ }
+
+ old = uval;
+ ret = futex_atomic_cmpxchg_inatomic(&uval, uaddr, old,
+ FUTEX_TID(old) | FUTEX_WAITERS);
+ if (ret)
+ goto dequeue;
+ if (uval == old)
+ break;
+ }
+
+ schedule_preempt_disabled();
+
+ /*
+ * Got a signal? Return EINTR
+ */
+ if (unlikely(signal_pending(qnode.task))) {
+ ret = -EINTR;
+ break; /* Remove from queue */
+ }
+ }
+dequeue:
+ __set_current_state(TASK_RUNNING);
+ /*
+ * Remove itself from the wait queue and go back to optimistic
+ * spinning if it hasn't got the lock yet.
+ */
+ put_task_struct(qnode.task);
+ spin_lock(&qh->wlock);
+ list_del(&qnode.wnode);
+
+ /*
+ * Try to clear the waiter bit if the wait queue is empty
+ */
+ if (list_empty(&qh->waitq)) {
+ int retval = get_futex_value_locked(&uval, uaddr);
+
+ if (!retval && FUTEX_HAS_WAITERS(uval)) {
+ old = uval;
+ uval &= ~FUTEX_WAITERS;
+ (void)cmpxchg_futex_value_locked(&uval, uaddr, old,
+ uval);
+ }
+ }
+ spin_unlock(&qh->wlock);
+ preempt_enable();
+
+ cnt = atomic_dec_return(&qh->lcnt);
+ if (cnt == 0)
+ qhead_free(qh, hb);
+ /*
+ * Need to set the waiters bit there are still waiters
+ */
+ else if (!ret)
+ ret = put_user(vpid | FUTEX_WAITERS, uaddr);
+out:
+ put_futex_key(&key);
+ return ret;
+
+hbunlock_out:
+ spin_unlock(&hb->lock);
+ goto out;
+}
+
+/*
+ * futex_spin_unlock
+ */
+static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
+{
+ struct futex_hash_bucket *hb;
+ struct futex_q_head *qh;
+ union futex_key key;
+ struct task_struct *wtask; /* Task to be woken */
+ int ret, lcnt;
+ u32 uval, old, vpid = task_pid_vnr(current);
+
+ ret = get_user(uval, uaddr);
+ if (ret)
+ return ret;
+
+ /*
+ * The unlocker may have cleared the TID value and another task may
+ * steal it. However, if its TID is still set, we need to clear
+ * it as well as the FUTEX_WAITERS bit.
+ */
+ while (FUTEX_TID(uval) == vpid) {
+ old = uval;
+ uval &= ~(FUTEX_TID_MASK | FUTEX_WAITERS);
+ ret = futex_atomic_cmpxchg_inatomic(&uval, uaddr, old, uval);
+ if (ret)
+ return ret;
+ if (uval == old)
+ break;
+ }
+ /*
+ * Need to wake up a waiter
+ */
+ ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
+ if (unlikely(ret))
+ return ret;
+
+ hb = hash_futex(&key);
+ spin_lock(&hb->lock);
+
+ /*
+ * Locate the queue head for the given key
+ */
+ qh = find_qhead(hb, &key);
+ if (!qh) {
+ spin_unlock(&hb->lock);
+ ret = -ESRCH; /* Invalid futex address */
+ goto out;
+ }
+
+ /*
+ * The hash bucket lock is being hold while retrieving the task
+ * structure from the queue head to make sure that the qh structure
+ * won't go away under the hood.
+ * Preemption is disabled during the wakeup process to avoid having
+ * a long gap between getting the task structure and waking it up.
+ */
+ wtask = NULL;
+ preempt_disable();
+ lcnt = atomic_read(&qh->lcnt);
+ if (lcnt) {
+ spin_lock(&qh->wlock);
+ if (!list_empty(&qh->waitq))
+ wtask = list_entry(qh->waitq.next, struct futex_q_node,
+ wnode)->task;
+ spin_unlock(&qh->wlock);
+ }
+ spin_unlock(&hb->lock);
+ if (wtask && wake_up_process(wtask))
+ ret = 1;
+ else if (!wtask)
+ ret = -ESRCH;
+ preempt_enable();
+out:
+ put_futex_key(&key);
+ return ret;
+}
+
+
long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
u32 __user *uaddr2, u32 val2, u32 val3)
{
@@ -2960,6 +3405,8 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
case FUTEX_TRYLOCK_PI:
case FUTEX_WAIT_REQUEUE_PI:
case FUTEX_CMP_REQUEUE_PI:
+ case FUTEX_SPIN_LOCK:
+ case FUTEX_SPIN_UNLOCK:
if (!futex_cmpxchg_enabled)
return -ENOSYS;
}
@@ -2991,6 +3438,10 @@ long do_futex(u32 __user *uaddr, int op, u32 val, ktime_t *timeout,
uaddr2);
case FUTEX_CMP_REQUEUE_PI:
return futex_requeue(uaddr, flags, uaddr2, val, val2, &val3, 1);
+ case FUTEX_SPIN_LOCK:
+ return futex_spin_lock(uaddr, flags);
+ case FUTEX_SPIN_UNLOCK:
+ return futex_spin_unlock(uaddr, flags);
}
return -ENOSYS;
}
@@ -3072,7 +3523,9 @@ static int __init futex_init(void)
for (i = 0; i < futex_hashsize; i++) {
atomic_set(&futex_queues[i].waiters, 0);
plist_head_init(&futex_queues[i].chain);
+ INIT_LIST_HEAD(&futex_queues[i].oslist);
spin_lock_init(&futex_queues[i].lock);
+ futex_queues[i].qhcache = NULL;
}
return 0;
--
1.7.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 1/5] futex: add new exclusive lock & unlock command codes
2014-07-21 15:24 ` [RFC PATCH 1/5] futex: add new exclusive lock & unlock command codes Waiman Long
@ 2014-07-21 16:42 ` Thomas Gleixner
2014-07-22 18:22 ` Waiman Long
0 siblings, 1 reply; 51+ messages in thread
From: Thomas Gleixner @ 2014-07-21 16:42 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
Heiko Carstens, LKML, linux-api, linux-doc, Jason Low,
Scott J Norton
On Mon, 21 Jul 2014, Waiman Long wrote:
> +#define FUTEX_TID(u) (pid_t)((u) & FUTEX_TID_MASK)
> +#define FUTEX_HAS_WAITERS(u) ((u) & FUTEX_WAITERS)
You love ugly macros, right?
> +/*
> + * futex_spin_trylock - attempt to take the lock
> + * Return: 1 if successful or an error happen
> + * 0 otherwise
> + *
> + * Side effect: The uval and ret will be updated.
> + */
> +static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
> + int *pret, u32 vpid)
> +{
> + u32 old;
> +
> + *pret = get_futex_value_locked(puval, uaddr);
> + if (*pret)
> + return 1;
> +
> + if (FUTEX_TID(*puval))
> + return 0; /* The mutex is not free */
> +
> + old = *puval;
> +
> + *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
> + if (*pret)
> + return 1;
> + if (*puval == old) {
> + /* Adjust uval to reflect current value */
> + *puval = vpid | old;
> + return 1;
> + }
> + return 0;
What's the point if all of this?
A simple cmpxchg_futex_value_locked() does all of this, just less ugly
and without all these extra indirections and totally uncomprehensible
conditionals.
> +}
> +
> +/*
> + * futex_spin_lock
> + */
> +static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> +{
So this lacks a timeout. If we provide this, then we need to have the
timeout supported as well.
> + struct futex_hash_bucket *hb;
> + struct futex_q_head *qh = NULL;
> + struct futex_q_node qnode;
> + union futex_key key;
> + bool gotlock;
> + int ret, cnt;
> + u32 uval, vpid, old;
> +
> + qnode.task = current;
> + vpid = task_pid_vnr(qnode.task);
> +
> + ret = get_futex_key(uaddr, flags & FLAGS_SHARED, &key, VERIFY_WRITE);
> + if (unlikely(ret))
Stop sprinkling the code with unlikelys
> + return ret;
> +
> + hb = hash_futex(&key);
> + spin_lock(&hb->lock);
> +
> + /*
> + * Locate the queue head for the given key
> + */
Brilliant comment. If you'd comment the stuff which really matters and
leave out the obvious, then your code might be readable some day.
> + qh = find_qhead(hb, &key);
> +
> + /*
> + * Check the futex value under the hash bucket lock as it might
> + * be changed.
> + */
What might have changed? You enter the function with uaddr, but no
uval. So what changed?
> + if (futex_spin_trylock(uaddr, &uval, &ret, vpid))
> + goto hbunlock_out;
> +
> + if (!qh) {
> + /*
> + * First waiter:
> + * Allocate a queue head structure & initialize it
> + */
> + qh = qhead_alloc_init(hb, &key);
> + if (unlikely(!qh)) {
> + ret = -ENOMEM;
> + goto hbunlock_out;
> + }
> + } else {
> + atomic_inc(&qh->lcnt);
> + }
> + spin_unlock(&hb->lock);
> +
> + /*
> + * Put the task into the wait queue and sleep
> + */
> + preempt_disable();
Why?
> + get_task_struct(qnode.task);
So you get a task reference on current? What the heck is this for?
> + spin_lock(&qh->wlock);
> + list_add_tail(&qnode.wnode, &qh->waitq);
> + __set_current_state(TASK_INTERRUPTIBLE);
> + spin_unlock(&qh->wlock);
> + gotlock = false;
> + for (;;) {
> + ret = get_user(uval, uaddr);
> + if (ret)
> + break;
So you let user space handle EFAULT?
> +dequeue:
> + __set_current_state(TASK_RUNNING);
> + /*
> + * Remove itself from the wait queue and go back to optimistic
> + * spinning if it hasn't got the lock yet.
> + */
> + put_task_struct(qnode.task);
> + spin_lock(&qh->wlock);
> + list_del(&qnode.wnode);
> +
> + /*
> + * Try to clear the waiter bit if the wait queue is empty
> + */
> + if (list_empty(&qh->waitq)) {
> + int retval = get_futex_value_locked(&uval, uaddr);
> +
> + if (!retval && FUTEX_HAS_WAITERS(uval)) {
> + old = uval;
> + uval &= ~FUTEX_WAITERS;
> + (void)cmpxchg_futex_value_locked(&uval, uaddr, old,
> + uval);
> + }
> + }
> + spin_unlock(&qh->wlock);
> + preempt_enable();
> +
> + cnt = atomic_dec_return(&qh->lcnt);
> + if (cnt == 0)
> + qhead_free(qh, hb);
> + /*
> + * Need to set the waiters bit there are still waiters
> + */
> + else if (!ret)
> + ret = put_user(vpid | FUTEX_WAITERS, uaddr);
WTF? You fiddle with the uaddr completely unprotected.
> +out:
> + put_futex_key(&key);
> + return ret;
> +
> +hbunlock_out:
> + spin_unlock(&hb->lock);
> + goto out;
> +}
> +
> +/*
> + * futex_spin_unlock
> + */
> +static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
> +{
> + struct futex_hash_bucket *hb;
> + struct futex_q_head *qh;
> + union futex_key key;
> + struct task_struct *wtask; /* Task to be woken */
> + int ret, lcnt;
> + u32 uval, old, vpid = task_pid_vnr(current);
> +
> + ret = get_user(uval, uaddr);
> + if (ret)
> + return ret;
> +
> + /*
> + * The unlocker may have cleared the TID value and another task may
> + * steal it. However, if its TID is still set, we need to clear
> + * it as well as the FUTEX_WAITERS bit.
No, that's complete and utter crap. The unlocker is current and it may
not have cleared anything.
Your design or the lack thereof is a complete disaster.
Sit down first and define the exact semantics of the new opcode. That
includes user and kernel space and the interaction with robust list,
which you happily ignored.
What are the semantics of uval? When can it be changed in kernel and
in user space? How do we deal with corruption of the user space value?
How does that new opcode provide robustness?
How are faults handled?
....
Before you can't provide a coherent description of that, nothing of
this is going to happen. And after that, it's definitely not going to
look like the hackery you delivered now.
Thanks,
tglx
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 1/5] futex: add new exclusive lock & unlock command codes
2014-07-21 16:42 ` Thomas Gleixner
@ 2014-07-22 18:22 ` Waiman Long
[not found] ` <53CEABD7.3030509-VXdhtT5mjnY@public.gmane.org>
0 siblings, 1 reply; 51+ messages in thread
From: Waiman Long @ 2014-07-22 18:22 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Ingo Molnar, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
Heiko Carstens, LKML, linux-api, linux-doc, Jason Low,
Scott J Norton
On 07/21/2014 12:42 PM, Thomas Gleixner wrote:
> On Mon, 21 Jul 2014, Waiman Long wrote:
>
>> +#define FUTEX_TID(u) (pid_t)((u)& FUTEX_TID_MASK)
>> +#define FUTEX_HAS_WAITERS(u) ((u)& FUTEX_WAITERS)
> You love ugly macros, right?
>
Not really, I just have a tendency to overuse it sometimes. I could take
those macros out.
>> +/*
>> + * futex_spin_trylock - attempt to take the lock
>> + * Return: 1 if successful or an error happen
>> + * 0 otherwise
>> + *
>> + * Side effect: The uval and ret will be updated.
>> + */
>> +static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
>> + int *pret, u32 vpid)
>> +{
>> + u32 old;
>> +
>> + *pret = get_futex_value_locked(puval, uaddr);
>> + if (*pret)
>> + return 1;
>> +
>> + if (FUTEX_TID(*puval))
>> + return 0; /* The mutex is not free */
>> +
>> + old = *puval;
>> +
>> + *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
>> + if (*pret)
>> + return 1;
>> + if (*puval == old) {
>> + /* Adjust uval to reflect current value */
>> + *puval = vpid | old;
>> + return 1;
>> + }
>> + return 0;
> What's the point if all of this?
>
> A simple cmpxchg_futex_value_locked() does all of this, just less ugly
> and without all these extra indirections and totally uncomprehensible
> conditionals.
>
Yes, the trylock function is somewhat unwieldy. Will integrate it back
to the corresponding place. As a trylock, we usually do a read first to
make sure that it is ready before doing cmpxchg. Blindly doing a cmpxhg
unconditionally may hinder performance.
>> +}
>> +
>> +/*
>> + * futex_spin_lock
>> + */
>> +static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
>> +{
> So this lacks a timeout. If we provide this, then we need to have the
> timeout supported as well.
Yes, a timeout isn't supported yet. This is a RFC and I want to get a
sense of how important a timeout will be before I add it in. I could
certainly add that in if people think it is an important feature to have.
>> + struct futex_hash_bucket *hb;
>> + struct futex_q_head *qh = NULL;
>> + struct futex_q_node qnode;
>> + union futex_key key;
>> + bool gotlock;
>> + int ret, cnt;
>> + u32 uval, vpid, old;
>> +
>> + qnode.task = current;
>> + vpid = task_pid_vnr(qnode.task);
>> +
>> + ret = get_futex_key(uaddr, flags& FLAGS_SHARED,&key, VERIFY_WRITE);
>> + if (unlikely(ret))
> Stop sprinkling the code with unlikelys
Sure. Will remove those unlikely() calls.
>> + return ret;
>> +
>> + hb = hash_futex(&key);
>> + spin_lock(&hb->lock);
>> +
>> + /*
>> + * Locate the queue head for the given key
>> + */
> Brilliant comment. If you'd comment the stuff which really matters and
> leave out the obvious, then your code might be readable some day.
That comment was before I extracted the code out into a separate
function. Will remove it.
>> + qh = find_qhead(hb,&key);
>> +
>> + /*
>> + * Check the futex value under the hash bucket lock as it might
>> + * be changed.
>> + */
> What might have changed? You enter the function with uaddr, but no
> uval. So what changed?
If there is contention, the spin_lock() call may take a while. Unlike a
wait-wake futex, the only uval that will be of interest is when the TID
portion is 0. So we don't really need to pass in an uval. The uval is
not 0 when the lock function is called. However, the lock owner may have
released the lock by the time we check the futex value there before we
go into spinning or waiting.
>
>
>> + if (futex_spin_trylock(uaddr,&uval,&ret, vpid))
>> + goto hbunlock_out;
>> +
>> + if (!qh) {
>> + /*
>> + * First waiter:
>> + * Allocate a queue head structure& initialize it
>> + */
>> + qh = qhead_alloc_init(hb,&key);
>> + if (unlikely(!qh)) {
>> + ret = -ENOMEM;
>> + goto hbunlock_out;
>> + }
>> + } else {
>> + atomic_inc(&qh->lcnt);
>> + }
>> + spin_unlock(&hb->lock);
>> +
>> + /*
>> + * Put the task into the wait queue and sleep
>> + */
>> + preempt_disable();
> Why?
I just follow what has been done in the mutex code where preemption is
disabled even in the sleeping loop.
>
>> + get_task_struct(qnode.task);
> So you get a task reference on current? What the heck is this for?
Because the task is going to sleep and a queue node with the task
pointer is going to be enqueued into the wait queue.
>> + spin_lock(&qh->wlock);
>> + list_add_tail(&qnode.wnode,&qh->waitq);
>> + __set_current_state(TASK_INTERRUPTIBLE);
>> + spin_unlock(&qh->wlock);
>> + gotlock = false;
>> + for (;;) {
>> + ret = get_user(uval, uaddr);
>> + if (ret)
>> + break;
> So you let user space handle EFAULT?
This is a good question. Do you have any suggestion on how to better
handle error when get_user fails?
>
>> +dequeue:
>> + __set_current_state(TASK_RUNNING);
>> + /*
>> + * Remove itself from the wait queue and go back to optimistic
>> + * spinning if it hasn't got the lock yet.
>> + */
>> + put_task_struct(qnode.task);
>> + spin_lock(&qh->wlock);
>> + list_del(&qnode.wnode);
>> +
>> + /*
>> + * Try to clear the waiter bit if the wait queue is empty
>> + */
>> + if (list_empty(&qh->waitq)) {
>> + int retval = get_futex_value_locked(&uval, uaddr);
>> +
>> + if (!retval&& FUTEX_HAS_WAITERS(uval)) {
>> + old = uval;
>> + uval&= ~FUTEX_WAITERS;
>> + (void)cmpxchg_futex_value_locked(&uval, uaddr, old,
>> + uval);
>> + }
>> + }
>> + spin_unlock(&qh->wlock);
>> + preempt_enable();
>> +
>> + cnt = atomic_dec_return(&qh->lcnt);
>> + if (cnt == 0)
>> + qhead_free(qh, hb);
>> + /*
>> + * Need to set the waiters bit there are still waiters
>> + */
>> + else if (!ret)
>> + ret = put_user(vpid | FUTEX_WAITERS, uaddr);
> WTF? You fiddle with the uaddr completely unprotected.
The get_futex_key(...., VERIFY_WRITE) call has check to make sure that
the location is writeable and get_user() call has happened without
error. What additional protection do you think we need here?
>> +out:
>> + put_futex_key(&key);
>> + return ret;
>> +
>> +hbunlock_out:
>> + spin_unlock(&hb->lock);
>> + goto out;
>> +}
>> +
>> +/*
>> + * futex_spin_unlock
>> + */
>> +static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
>> +{
>> + struct futex_hash_bucket *hb;
>> + struct futex_q_head *qh;
>> + union futex_key key;
>> + struct task_struct *wtask; /* Task to be woken */
>> + int ret, lcnt;
>> + u32 uval, old, vpid = task_pid_vnr(current);
>> +
>> + ret = get_user(uval, uaddr);
>> + if (ret)
>> + return ret;
>> +
>> + /*
>> + * The unlocker may have cleared the TID value and another task may
>> + * steal it. However, if its TID is still set, we need to clear
>> + * it as well as the FUTEX_WAITERS bit.
> No, that's complete and utter crap. The unlocker is current and it may
> not have cleared anything.
>
> Your design or the lack thereof is a complete disaster.
In patch 5, the documentation and the sample unlock does clear the TID
before going in. The code here is just a safety measure in case the
unlocker doesn't follow the recommendation.
> Sit down first and define the exact semantics of the new opcode. That
> includes user and kernel space and the interaction with robust list,
> which you happily ignored.
>
> What are the semantics of uval? When can it be changed in kernel and
> in user space? How do we deal with corruption of the user space value?
The semantics of the uval is the same as that of PI and robust futex
where the TID portion contains the thread ID of the lock owner. It is my
intention to make it works with the robust futex mechanism before it can
be merged. This RPC patch series is for soliciting feedbacks and make
the necessary changes that make the patch acceptable before I go deep
into making it works with robust futex.
>
> How does that new opcode provide robustness?
>
> How are faults handled?
As you have a lot more experience working with futexes than me, any
suggestions on what kind of faults will happen and what are the best
practices to handle them will be highly appreciated.
-Longman
^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <1405956271-34339-1-git-send-email-Waiman.Long-VXdhtT5mjnY@public.gmane.org>]
* [RFC PATCH 2/5] futex: add optimistic spinning to FUTEX_SPIN_LOCK
[not found] ` <1405956271-34339-1-git-send-email-Waiman.Long-VXdhtT5mjnY@public.gmane.org>
@ 2014-07-21 15:24 ` Waiman Long
[not found] ` <1405956271-34339-3-git-send-email-Waiman.Long-VXdhtT5mjnY@public.gmane.org>
2014-07-21 20:17 ` Jason Low
0 siblings, 2 replies; 51+ messages in thread
From: Waiman Long @ 2014-07-21 15:24 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens
Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, Jason Low, Scott J Norton,
Waiman Long
This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
primitive on the futex value when the lock owner is running. It is
the same optimistic spinning technique that is done in the mutex and
rw semaphore code to improve their performance especially on large
systems with large number of CPUs. When the lock owner is not running,
the spinning tasks will go to sleep.
There is 2 major advantages of doing optimistic spinning here:
1) It eliminates the context switching latency and overhead (at
least a few us) associated with sleeping and wakeup.
2) It eliminates most of the need to call futex(2) with
FUTEX_SPIN_UNLOCK as spinning is done without the need to set
the FUTEX_WAITERS bit.
Active spinning, however, does consume time in the current quantum of
time slice, make it a bit more likely to be preempted while running
in the critcal section due to time slice expiration. The heavy spinlock
contention of a wait-wake futex has the same effect. So it is not
specific
to this new primitive.
With the addition of optimistic spinning, it can significantly speed
up the amount of mutex operations that can be done in a certain unit
of time. With a userspace mutex microbenchmark running 10 million
mutex operations with 256 threads on a 4-socket 40-core server, the
spinning futex can achieve a rate of about 4.9 Mops/s with a critical
section load of 10 pause instructions. Whereas the wait-wake futex can
only achieve a rate of 3.7 Mops/s. When increasing the load to 100,
the corresponding rates become 2.8 Mops/s and 0.8 Mops/s respectively.
Signed-off-by: Waiman Long <Waiman.Long-VXdhtT5mjnY@public.gmane.org>
---
kernel/futex.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 172 insertions(+), 18 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index ec9b6ee..ddc24d1 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -71,6 +71,7 @@
#include <asm/futex.h>
#include "locking/rtmutex_common.h"
+#include "locking/mcs_spinlock.h"
/*
* READ this before attempting to hack on futexes!
@@ -2995,30 +2996,51 @@ void exit_robust_list(struct task_struct *curr)
#define FUTEX_TID(u) (pid_t)((u) & FUTEX_TID_MASK)
#define FUTEX_HAS_WAITERS(u) ((u) & FUTEX_WAITERS)
+/*
+ * Bit usage of the locker count:
+ * bit 0-23: number of lockers (spinners + waiters)
+ * bit 24-30: number of spinners
+ */
+#define FUTEX_SPINCNT_MAX 64 /* Maximum # of spinners */
+#define FUTEX_SPINCNT_SHIFT 24
+#define FUTEX_SPINCNT_BIAS (1U << FUTEX_SPINCNT_SHIFT)
+#define FUTEX_LOCKCNT_MASK (FUTEX_SPINCNT_BIAS - 1)
+#define FUTEX_LOCKCNT(qh) (atomic_read(&(qh)->lcnt) & FUTEX_LOCKCNT_MASK)
+#define FUTEX_SPINCNT(qh) (atomic_read(&(qh)->lcnt)>>FUTEX_SPINCNT_SHIFT)
+
/**
* struct futex_q_head - head of the optspin futex queue, one per unique key
* @hnode: list entry from the hash bucket
* @waitq: a list of waiting tasks
* @key: the key the futex is hashed on
+ * @osq: pointer to optimisitic spinning queue
+ * @owner: task_struct pointer of the futex owner
+ * @otid: TID of the futex owner
* @wlock: spinlock for managing wait queue
- * @lcnt: locker count
+ * @lcnt: locker count (spinners + waiters)
*
* Locking sequence
* ----------------
* 1) Lock hash bucket spinlock, locate the futex queue head
* 2) Inc lcnt (lock) or read lcnt (unlock), release hash bucket spinlock
- * 3) For waiter:
+ * 3) For spinner:
+ * - enqueue into the spinner queue and wait for its turn.
+ * 4) For waiter:
* - lock futex queue head spinlock
* - enqueue into the wait queue
* - release the lock & sleep
- * 4) For unlocker:
+ * 5) For unlocker:
* - locate the queue head just like a locker does
- * - Take the queue head lock and wake up the first waiter there.
+ * - clear the owner field if it is the current owner
+ * - if the locker count is not 0 & osq is empty, take the queue head lock
+ * and wake up the first waiter.
*/
struct futex_q_head {
struct list_head hnode;
struct list_head waitq;
union futex_key key;
+ struct optimistic_spin_queue *osq;
+ struct task_struct *owner;
pid_t otid;
spinlock_t wlock;
atomic_t lcnt;
@@ -3034,6 +3056,13 @@ struct futex_q_node {
struct task_struct *task;
};
+/*
+ * The maximum number of tasks that can be a futex spin queue
+ *
+ * It is set to the lesser of half of the total number of CPUs and
+ * FUTEX_SPINCNT_MAX to avoid locking up all the CPUs in spinning.
+ */
+static int __read_mostly futex_spincnt_max;
/*
* find_qhead - Find a queue head structure with the matching key
@@ -3061,7 +3090,7 @@ find_qhead(struct futex_hash_bucket *hb, union futex_key *key)
* contention with no hash bucket collision.
*/
static inline struct futex_q_head *
-qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
+qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key, u32 uval)
{
struct futex_q_head *qh = NULL;
static const struct futex_q_head qh0 = { { 0 } };
@@ -3073,10 +3102,16 @@ qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
/*
* Initialize the queue head structure
+ * The lock owner field may be NULL if the task has released the lock
+ * and exit.
*/
if (qh) {
- *qh = qh0;
- qh->key = *key;
+ *qh = qh0;
+ qh->key = *key;
+ qh->otid = FUTEX_TID(uval);
+ qh->owner = futex_find_get_task(qh->otid);
+ if (unlikely(!qh->owner))
+ qh->otid = 0;
atomic_set(&qh->lcnt, 1);
INIT_LIST_HEAD(&qh->waitq);
spin_lock_init(&qh->wlock);
@@ -3120,9 +3155,11 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
/*
* Free the queue head structure
*/
- BUG_ON(!list_empty(&qh->waitq));
+ BUG_ON(!list_empty(&qh->waitq) || qh->osq);
list_del(&qh->hnode);
spin_unlock(&hb->lock);
+ if (qh->owner)
+ put_task_struct(qh->owner);
if (!hb->qhcache && (cmpxchg(&hb->qhcache, NULL, qh) == NULL))
return;
@@ -3134,14 +3171,19 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
* Return: 1 if successful or an error happen
* 0 otherwise
*
+ * Optimistic spinning is done without holding lock, but with page fault
+ * explicitly disabled. So different functions need to be used to access
+ * the userspace futex value.
+ *
* Side effect: The uval and ret will be updated.
*/
static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
- int *pret, u32 vpid)
+ int *pret, u32 vpid, bool spinning)
{
- u32 old;
+ u32 old;
- *pret = get_futex_value_locked(puval, uaddr);
+ *pret = spinning ? __copy_from_user_inatomic(puval, uaddr, sizeof(u32))
+ : get_futex_value_locked(puval, uaddr);
if (*pret)
return 1;
@@ -3150,18 +3192,102 @@ static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
old = *puval;
- *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
+ *pret = spinning
+ ? futex_atomic_cmpxchg_inatomic(puval, uaddr, old, vpid)
+ : cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
+
if (*pret)
return 1;
if (*puval == old) {
/* Adjust uval to reflect current value */
- *puval = vpid | old;
+ *puval = spinning ? vpid : (vpid | old);
return 1;
}
return 0;
}
/*
+ * futex_optspin - optimistic spinning loop
+ * Return: 1 if lock successfully acquired
+ * 0 if need to fall back to waiting
+ *
+ * Page fault and preemption are disabled in the optimistic spinning
+ * loop. Preemption should have been disabled before calling this function.
+ *
+ * The number of spinners may temporarily exceed the threshold due to
+ * racing (the spin count check and add aren't atomic), but that shouldn't
+ * be a big problem.
+ */
+static inline int futex_optspin(struct futex_q_head *qh,
+ struct futex_q_node *qn,
+ u32 __user *uaddr,
+ u32 vpid)
+{
+ u32 uval;
+ int ret, gotlock = false;
+ struct task_struct *owner;
+
+ /*
+ * Increment the spinner count
+ */
+ atomic_add(FUTEX_SPINCNT_BIAS, &qh->lcnt);
+ if (!osq_lock(&qh->osq)) {
+ atomic_sub(FUTEX_SPINCNT_BIAS, &qh->lcnt);
+ return gotlock;
+ }
+ pagefault_disable();
+ for (;; cpu_relax()) {
+ if (futex_spin_trylock(uaddr, &uval, &ret, vpid, true)) {
+ /*
+ * Fall back to waiting if an error happen
+ */
+ if (ret)
+ break;
+ qh->otid = vpid;
+ owner = xchg(&qh->owner, qn->task);
+ get_task_struct(qn->task);
+ if (owner)
+ put_task_struct(owner);
+ gotlock = true;
+ break;
+ } else if (unlikely(FUTEX_HAS_WAITERS(uval))) {
+ /*
+ * Try to turn off the waiter bit as it now has a
+ * spinner. It doesn't matter if it fails as it will
+ * try again in the next iteration.
+ */
+ (void)futex_atomic_cmpxchg_inatomic
+ (&uval, uaddr, uval, uval & ~FUTEX_WAITERS);
+ }
+
+ if (unlikely(FUTEX_TID(uval) != qh->otid)) {
+ /*
+ * Owner has changed
+ */
+ qh->otid = FUTEX_TID(uval);
+ owner = xchg(&qh->owner, futex_find_get_task(qh->otid));
+ if (owner)
+ put_task_struct(owner);
+ }
+ owner = ACCESS_ONCE(qh->owner);
+ if ((owner && !ACCESS_ONCE(owner->on_cpu)) || need_resched())
+ break;
+ }
+ pagefault_enable();
+ osq_unlock(&qh->osq);
+ atomic_sub(FUTEX_SPINCNT_BIAS, &qh->lcnt);
+
+ /*
+ * If we fell out of the spin path because of need_resched(),
+ * reschedule now.
+ */
+ if (!gotlock && need_resched())
+ schedule_preempt_disabled();
+
+ return gotlock;
+}
+
+/*
* futex_spin_lock
*/
static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
@@ -3170,6 +3296,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
struct futex_q_head *qh = NULL;
struct futex_q_node qnode;
union futex_key key;
+ struct task_struct *owner;
bool gotlock;
int ret, cnt;
u32 uval, vpid, old;
@@ -3193,7 +3320,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
* Check the futex value under the hash bucket lock as it might
* be changed.
*/
- if (futex_spin_trylock(uaddr, &uval, &ret, vpid))
+ if (futex_spin_trylock(uaddr, &uval, &ret, vpid, false))
goto hbunlock_out;
if (!qh) {
@@ -3201,7 +3328,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
* First waiter:
* Allocate a queue head structure & initialize it
*/
- qh = qhead_alloc_init(hb, &key);
+ qh = qhead_alloc_init(hb, &key, uval);
if (unlikely(!qh)) {
ret = -ENOMEM;
goto hbunlock_out;
@@ -3212,9 +3339,18 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
spin_unlock(&hb->lock);
/*
- * Put the task into the wait queue and sleep
+ * Perform optimisitic spinning if the owner is running.
*/
preempt_disable();
+ owner = ACCESS_ONCE(qh->owner);
+ if ((FUTEX_SPINCNT(qh) < futex_spincnt_max) &&
+ (!owner || owner->on_cpu))
+ if (futex_optspin(qh, &qnode, uaddr, vpid))
+ goto penable_out;
+
+ /*
+ * Put the task into the wait queue and sleep
+ */
get_task_struct(qnode.task);
spin_lock(&qh->wlock);
list_add_tail(&qnode.wnode, &qh->waitq);
@@ -3238,6 +3374,11 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
goto dequeue;
} else if (uval == old) {
gotlock = true;
+ qh->otid = vpid;
+ owner = xchg(&qh->owner, qnode.task);
+ get_task_struct(qnode.task);
+ if (owner)
+ put_task_struct(owner);
goto dequeue;
}
continue;
@@ -3286,15 +3427,17 @@ dequeue:
}
}
spin_unlock(&qh->wlock);
+
+penable_out:
preempt_enable();
cnt = atomic_dec_return(&qh->lcnt);
if (cnt == 0)
qhead_free(qh, hb);
/*
- * Need to set the waiters bit there are still waiters
+ * Need to set the waiters bit there no spinner running
*/
- else if (!ret)
+ else if (!ret && ((cnt >> FUTEX_SPINCNT_SHIFT) == 0))
ret = put_user(vpid | FUTEX_WAITERS, uaddr);
out:
put_futex_key(&key);
@@ -3356,6 +3499,13 @@ static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
}
/*
+ * Clear the owner field
+ */
+ if ((qh->owner == current) &&
+ (cmpxchg(&qh->owner, current, NULL) == current))
+ put_task_struct(current);
+
+ /*
* The hash bucket lock is being hold while retrieving the task
* structure from the queue head to make sure that the qh structure
* won't go away under the hood.
@@ -3520,6 +3670,10 @@ static int __init futex_init(void)
futex_detect_cmpxchg();
+ futex_spincnt_max = num_possible_cpus()/2;
+ if (futex_spincnt_max > FUTEX_SPINCNT_MAX)
+ futex_spincnt_max = FUTEX_SPINCNT_MAX;
+
for (i = 0; i < futex_hashsize; i++) {
atomic_set(&futex_queues[i].waiters, 0);
plist_head_init(&futex_queues[i].chain);
--
1.7.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
[parent not found: <1405956271-34339-3-git-send-email-Waiman.Long-VXdhtT5mjnY@public.gmane.org>]
* Re: [RFC PATCH 2/5] futex: add optimistic spinning to FUTEX_SPIN_LOCK
[not found] ` <1405956271-34339-3-git-send-email-Waiman.Long-VXdhtT5mjnY@public.gmane.org>
@ 2014-07-21 17:15 ` Davidlohr Bueso
[not found] ` <1405962929.11927.19.camel-5JQ4ckphU/8SZAcGdq5asR6epYMZPwEe5NbjCUgZEJk@public.gmane.org>
0 siblings, 1 reply; 51+ messages in thread
From: Davidlohr Bueso @ 2014-07-21 17:15 UTC (permalink / raw)
To: Waiman Long
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Heiko Carstens, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, Jason Low, Scott J Norton
On Mon, 2014-07-21 at 11:24 -0400, Waiman Long wrote:
> This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
> primitive on the futex value when the lock owner is running. It is
> the same optimistic spinning technique that is done in the mutex and
> rw semaphore code to improve their performance especially on large
> systems with large number of CPUs. When the lock owner is not running,
> the spinning tasks will go to sleep.
>
> There is 2 major advantages of doing optimistic spinning here:
> 1) It eliminates the context switching latency and overhead (at
> least a few us) associated with sleeping and wakeup.
> 2) It eliminates most of the need to call futex(2) with
> FUTEX_SPIN_UNLOCK as spinning is done without the need to set
> the FUTEX_WAITERS bit.
I think this belongs with Patch 1: optimistic spinning feature should be
in the same patch when you add the new futex commands.
> Active spinning, however, does consume time in the current quantum of
> time slice, make it a bit more likely to be preempted while running
> in the critcal section due to time slice expiration. The heavy spinlock
> contention of a wait-wake futex has the same effect. So it is not
> specific
> to this new primitive.
>
> With the addition of optimistic spinning, it can significantly speed
> up the amount of mutex operations that can be done in a certain unit
> of time. With a userspace mutex microbenchmark running 10 million
> mutex operations with 256 threads on a 4-socket 40-core server, the
> spinning futex can achieve a rate of about 4.9 Mops/s with a critical
> section load of 10 pause instructions. Whereas the wait-wake futex can
> only achieve a rate of 3.7 Mops/s. When increasing the load to 100,
> the corresponding rates become 2.8 Mops/s and 0.8 Mops/s respectively.
>
> Signed-off-by: Waiman Long <Waiman.Long-VXdhtT5mjnY@public.gmane.org>
> ---
> kernel/futex.c | 190 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 172 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/futex.c b/kernel/futex.c
> index ec9b6ee..ddc24d1 100644
> --- a/kernel/futex.c
> +++ b/kernel/futex.c
> @@ -71,6 +71,7 @@
> #include <asm/futex.h>
>
> #include "locking/rtmutex_common.h"
> +#include "locking/mcs_spinlock.h"
>
> /*
> * READ this before attempting to hack on futexes!
> @@ -2995,30 +2996,51 @@ void exit_robust_list(struct task_struct *curr)
> #define FUTEX_TID(u) (pid_t)((u) & FUTEX_TID_MASK)
> #define FUTEX_HAS_WAITERS(u) ((u) & FUTEX_WAITERS)
>
> +/*
> + * Bit usage of the locker count:
> + * bit 0-23: number of lockers (spinners + waiters)
> + * bit 24-30: number of spinners
> + */
> +#define FUTEX_SPINCNT_MAX 64 /* Maximum # of spinners */
> +#define FUTEX_SPINCNT_SHIFT 24
> +#define FUTEX_SPINCNT_BIAS (1U << FUTEX_SPINCNT_SHIFT)
> +#define FUTEX_LOCKCNT_MASK (FUTEX_SPINCNT_BIAS - 1)
> +#define FUTEX_LOCKCNT(qh) (atomic_read(&(qh)->lcnt) & FUTEX_LOCKCNT_MASK)
> +#define FUTEX_SPINCNT(qh) (atomic_read(&(qh)->lcnt)>>FUTEX_SPINCNT_SHIFT)
Both FUTEX_LOCKCNT and FUTEX_SPINCNT should be static inline functions.
> /**
> * struct futex_q_head - head of the optspin futex queue, one per unique key
> * @hnode: list entry from the hash bucket
> * @waitq: a list of waiting tasks
> * @key: the key the futex is hashed on
> + * @osq: pointer to optimisitic spinning queue
> + * @owner: task_struct pointer of the futex owner
> + * @otid: TID of the futex owner
> * @wlock: spinlock for managing wait queue
> - * @lcnt: locker count
> + * @lcnt: locker count (spinners + waiters)
> *
> * Locking sequence
> * ----------------
> * 1) Lock hash bucket spinlock, locate the futex queue head
> * 2) Inc lcnt (lock) or read lcnt (unlock), release hash bucket spinlock
> - * 3) For waiter:
> + * 3) For spinner:
> + * - enqueue into the spinner queue and wait for its turn.
> + * 4) For waiter:
> * - lock futex queue head spinlock
> * - enqueue into the wait queue
> * - release the lock & sleep
> - * 4) For unlocker:
> + * 5) For unlocker:
> * - locate the queue head just like a locker does
> - * - Take the queue head lock and wake up the first waiter there.
> + * - clear the owner field if it is the current owner
> + * - if the locker count is not 0 & osq is empty, take the queue head lock
> + * and wake up the first waiter.
> */
> struct futex_q_head {
> struct list_head hnode;
> struct list_head waitq;
> union futex_key key;
> + struct optimistic_spin_queue *osq;
> + struct task_struct *owner;
> pid_t otid;
> spinlock_t wlock;
> atomic_t lcnt;
> @@ -3034,6 +3056,13 @@ struct futex_q_node {
> struct task_struct *task;
> };
>
> +/*
> + * The maximum number of tasks that can be a futex spin queue
> + *
> + * It is set to the lesser of half of the total number of CPUs and
> + * FUTEX_SPINCNT_MAX to avoid locking up all the CPUs in spinning.
> + */
> +static int __read_mostly futex_spincnt_max;
>
> /*
> * find_qhead - Find a queue head structure with the matching key
> @@ -3061,7 +3090,7 @@ find_qhead(struct futex_hash_bucket *hb, union futex_key *key)
> * contention with no hash bucket collision.
> */
> static inline struct futex_q_head *
> -qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
> +qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key, u32 uval)
> {
> struct futex_q_head *qh = NULL;
> static const struct futex_q_head qh0 = { { 0 } };
> @@ -3073,10 +3102,16 @@ qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key)
>
> /*
> * Initialize the queue head structure
> + * The lock owner field may be NULL if the task has released the lock
> + * and exit.
> */
> if (qh) {
> - *qh = qh0;
> - qh->key = *key;
> + *qh = qh0;
> + qh->key = *key;
> + qh->otid = FUTEX_TID(uval);
> + qh->owner = futex_find_get_task(qh->otid);
> + if (unlikely(!qh->owner))
> + qh->otid = 0;
> atomic_set(&qh->lcnt, 1);
> INIT_LIST_HEAD(&qh->waitq);
> spin_lock_init(&qh->wlock);
All this can be a single qh setup function.
> @@ -3120,9 +3155,11 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
> /*
> * Free the queue head structure
> */
> - BUG_ON(!list_empty(&qh->waitq));
> + BUG_ON(!list_empty(&qh->waitq) || qh->osq);
> list_del(&qh->hnode);
> spin_unlock(&hb->lock);
> + if (qh->owner)
> + put_task_struct(qh->owner);
>
> if (!hb->qhcache && (cmpxchg(&hb->qhcache, NULL, qh) == NULL))
> return;
> @@ -3134,14 +3171,19 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
> * Return: 1 if successful or an error happen
> * 0 otherwise
> *
> + * Optimistic spinning is done without holding lock, but with page fault
> + * explicitly disabled. So different functions need to be used to access
> + * the userspace futex value.
> + *
> * Side effect: The uval and ret will be updated.
> */
> static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
> - int *pret, u32 vpid)
> + int *pret, u32 vpid, bool spinning)
> {
> - u32 old;
> + u32 old;
>
> - *pret = get_futex_value_locked(puval, uaddr);
> + *pret = spinning ? __copy_from_user_inatomic(puval, uaddr, sizeof(u32))
> + : get_futex_value_locked(puval, uaddr);
> if (*pret)
> return 1;
>
> @@ -3150,18 +3192,102 @@ static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
>
> old = *puval;
>
> - *pret = cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
> + *pret = spinning
> + ? futex_atomic_cmpxchg_inatomic(puval, uaddr, old, vpid)
> + : cmpxchg_futex_value_locked(puval, uaddr, old, vpid | old);
> +
> if (*pret)
> return 1;
> if (*puval == old) {
> /* Adjust uval to reflect current value */
> - *puval = vpid | old;
> + *puval = spinning ? vpid : (vpid | old);
> return 1;
> }
> return 0;
> }
>
> /*
> + * futex_optspin - optimistic spinning loop
> + * Return: 1 if lock successfully acquired
> + * 0 if need to fall back to waiting
> + *
> + * Page fault and preemption are disabled in the optimistic spinning
> + * loop. Preemption should have been disabled before calling this function.
> + *
> + * The number of spinners may temporarily exceed the threshold due to
> + * racing (the spin count check and add aren't atomic), but that shouldn't
> + * be a big problem.
> + */
> +static inline int futex_optspin(struct futex_q_head *qh,
> + struct futex_q_node *qn,
> + u32 __user *uaddr,
> + u32 vpid)
> +{
> + u32 uval;
> + int ret, gotlock = false;
> + struct task_struct *owner;
> +
> + /*
> + * Increment the spinner count
> + */
> + atomic_add(FUTEX_SPINCNT_BIAS, &qh->lcnt);
> + if (!osq_lock(&qh->osq)) {
> + atomic_sub(FUTEX_SPINCNT_BIAS, &qh->lcnt);
> + return gotlock;
> + }
> + pagefault_disable();
How about a comment to why pf is disabled.
> + for (;; cpu_relax()) {
while(true) {
> + if (futex_spin_trylock(uaddr, &uval, &ret, vpid, true)) {
> + /*
> + * Fall back to waiting if an error happen
> + */
> + if (ret)
> + break;
> + qh->otid = vpid;
> + owner = xchg(&qh->owner, qn->task);
> + get_task_struct(qn->task);
> + if (owner)
> + put_task_struct(owner);
> + gotlock = true;
> + break;
> + } else if (unlikely(FUTEX_HAS_WAITERS(uval))) {
Branch predictions have a time and place, please do not use
likely/unlikely just for anything.
> + /*
> + * Try to turn off the waiter bit as it now has a
> + * spinner. It doesn't matter if it fails as it will
> + * try again in the next iteration.
> + */
> + (void)futex_atomic_cmpxchg_inatomic
> + (&uval, uaddr, uval, uval & ~FUTEX_WAITERS);
> + }
> +
> + if (unlikely(FUTEX_TID(uval) != qh->otid)) {
> + /*
> + * Owner has changed
> + */
> + qh->otid = FUTEX_TID(uval);
> + owner = xchg(&qh->owner, futex_find_get_task(qh->otid));
> + if (owner)
> + put_task_struct(owner);
> + }
> + owner = ACCESS_ONCE(qh->owner);
> + if ((owner && !ACCESS_ONCE(owner->on_cpu)) || need_resched())
> + break;
> + }
> + pagefault_enable();
> + osq_unlock(&qh->osq);
> + atomic_sub(FUTEX_SPINCNT_BIAS, &qh->lcnt);
> +
> + /*
> + * If we fell out of the spin path because of need_resched(),
> + * reschedule now.
> + */
> + if (!gotlock && need_resched())
> + schedule_preempt_disabled();
> +
> + return gotlock;
> +}
> +
> +/*
> * futex_spin_lock
> */
> static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> @@ -3170,6 +3296,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> struct futex_q_head *qh = NULL;
> struct futex_q_node qnode;
> union futex_key key;
> + struct task_struct *owner;
> bool gotlock;
> int ret, cnt;
> u32 uval, vpid, old;
> @@ -3193,7 +3320,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> * Check the futex value under the hash bucket lock as it might
> * be changed.
> */
> - if (futex_spin_trylock(uaddr, &uval, &ret, vpid))
> + if (futex_spin_trylock(uaddr, &uval, &ret, vpid, false))
> goto hbunlock_out;
>
> if (!qh) {
> @@ -3201,7 +3328,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> * First waiter:
> * Allocate a queue head structure & initialize it
> */
> - qh = qhead_alloc_init(hb, &key);
> + qh = qhead_alloc_init(hb, &key, uval);
> if (unlikely(!qh)) {
> ret = -ENOMEM;
> goto hbunlock_out;
> @@ -3212,9 +3339,18 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> spin_unlock(&hb->lock);
>
> /*
> - * Put the task into the wait queue and sleep
> + * Perform optimisitic spinning if the owner is running.
> */
> preempt_disable();
> + owner = ACCESS_ONCE(qh->owner);
> + if ((FUTEX_SPINCNT(qh) < futex_spincnt_max) &&
> + (!owner || owner->on_cpu))
> + if (futex_optspin(qh, &qnode, uaddr, vpid))
> + goto penable_out;
> +
> + /*
> + * Put the task into the wait queue and sleep
> + */
> get_task_struct(qnode.task);
> spin_lock(&qh->wlock);
> list_add_tail(&qnode.wnode, &qh->waitq);
> @@ -3238,6 +3374,11 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
> goto dequeue;
> } else if (uval == old) {
> gotlock = true;
> + qh->otid = vpid;
> + owner = xchg(&qh->owner, qnode.task);
> + get_task_struct(qnode.task);
> + if (owner)
> + put_task_struct(owner);
> goto dequeue;
> }
> continue;
> @@ -3286,15 +3427,17 @@ dequeue:
> }
> }
> spin_unlock(&qh->wlock);
> +
> +penable_out:
> preempt_enable();
>
> cnt = atomic_dec_return(&qh->lcnt);
> if (cnt == 0)
> qhead_free(qh, hb);
> /*
> - * Need to set the waiters bit there are still waiters
> + * Need to set the waiters bit there no spinner running
> */
> - else if (!ret)
> + else if (!ret && ((cnt >> FUTEX_SPINCNT_SHIFT) == 0))
> ret = put_user(vpid | FUTEX_WAITERS, uaddr);
> out:
> put_futex_key(&key);
> @@ -3356,6 +3499,13 @@ static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
> }
>
> /*
> + * Clear the owner field
> + */
> + if ((qh->owner == current) &&
> + (cmpxchg(&qh->owner, current, NULL) == current))
> + put_task_struct(current);
> +
> + /*
> * The hash bucket lock is being hold while retrieving the task
> * structure from the queue head to make sure that the qh structure
> * won't go away under the hood.
> @@ -3520,6 +3670,10 @@ static int __init futex_init(void)
>
> futex_detect_cmpxchg();
>
> + futex_spincnt_max = num_possible_cpus()/2;
> + if (futex_spincnt_max > FUTEX_SPINCNT_MAX)
> + futex_spincnt_max = FUTEX_SPINCNT_MAX;
This threshold needs commenting as well.
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 2/5] futex: add optimistic spinning to FUTEX_SPIN_LOCK
2014-07-21 15:24 ` [RFC PATCH 2/5] futex: add optimistic spinning to FUTEX_SPIN_LOCK Waiman Long
[not found] ` <1405956271-34339-3-git-send-email-Waiman.Long-VXdhtT5mjnY@public.gmane.org>
@ 2014-07-21 20:17 ` Jason Low
2014-07-22 19:34 ` Waiman Long
1 sibling, 1 reply; 51+ messages in thread
From: Jason Low @ 2014-07-21 20:17 UTC (permalink / raw)
To: Waiman Long
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens, linux-kernel, linux-api,
linux-doc, Scott J Norton
On Mon, 2014-07-21 at 11:24 -0400, Waiman Long wrote:
> This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
> primitive on the futex value when the lock owner is running. It is
> the same optimistic spinning technique that is done in the mutex and
> rw semaphore code to improve their performance especially on large
> systems with large number of CPUs. When the lock owner is not running,
> the spinning tasks will go to sleep.
Perhaps we could introduce a "CONFIG_FUTEX_SPIN_ON_OWNER" that depends
on SMP and ARCH_SUPPORTS_ATOMIC_RMW?
> There is 2 major advantages of doing optimistic spinning here:
> 1) It eliminates the context switching latency and overhead (at
> least a few us) associated with sleeping and wakeup.
> 2) It eliminates most of the need to call futex(2) with
> FUTEX_SPIN_UNLOCK as spinning is done without the need to set
> the FUTEX_WAITERS bit.
> struct futex_q_head {
> struct list_head hnode;
> struct list_head waitq;
> union futex_key key;
> + struct optimistic_spin_queue *osq;
And this would have to be updated to
+ struct optimistic_spin_queue osq;
given the recent changes to the osq lock.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 2/5] futex: add optimistic spinning to FUTEX_SPIN_LOCK
2014-07-21 20:17 ` Jason Low
@ 2014-07-22 19:34 ` Waiman Long
0 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2014-07-22 19:34 UTC (permalink / raw)
To: Jason Low
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens, linux-kernel, linux-api,
linux-doc, Scott J Norton
On 07/21/2014 04:17 PM, Jason Low wrote:
> On Mon, 2014-07-21 at 11:24 -0400, Waiman Long wrote:
>> This patch adds code to do optimistic spinning for the FUTEX_SPIN_LOCK
>> primitive on the futex value when the lock owner is running. It is
>> the same optimistic spinning technique that is done in the mutex and
>> rw semaphore code to improve their performance especially on large
>> systems with large number of CPUs. When the lock owner is not running,
>> the spinning tasks will go to sleep.
> Perhaps we could introduce a "CONFIG_FUTEX_SPIN_ON_OWNER" that depends
> on SMP and ARCH_SUPPORTS_ATOMIC_RMW?
The new futex opcode depends on the ability to do cmpxchg() in the futex
context. The code will be disabled if futex cmpxchg is not supported. I
guess that should be enough to limit it to just a handful of architectures.
>> There is 2 major advantages of doing optimistic spinning here:
>> 1) It eliminates the context switching latency and overhead (at
>> least a few us) associated with sleeping and wakeup.
>> 2) It eliminates most of the need to call futex(2) with
>> FUTEX_SPIN_UNLOCK as spinning is done without the need to set
>> the FUTEX_WAITERS bit.
>> struct futex_q_head {
>> struct list_head hnode;
>> struct list_head waitq;
>> union futex_key key;
>> + struct optimistic_spin_queue *osq;
> And this would have to be updated to
>
> + struct optimistic_spin_queue osq;
>
> given the recent changes to the osq lock.
Yes, I will make the change in the next iteration of the patch.
-Longman
^ permalink raw reply [flat|nested] 51+ messages in thread
* [RFC PATCH 3/5] spinning futex: move a wakened task to spinning
2014-07-21 15:24 [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Waiman Long
2014-07-21 15:24 ` [RFC PATCH 1/5] futex: add new exclusive lock & unlock command codes Waiman Long
[not found] ` <1405956271-34339-1-git-send-email-Waiman.Long-VXdhtT5mjnY@public.gmane.org>
@ 2014-07-21 15:24 ` Waiman Long
2014-07-21 15:24 ` [RFC PATCH 4/5] spinning futex: put waiting tasks in a sorted rbtree Waiman Long
` (3 subsequent siblings)
6 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2014-07-21 15:24 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens
Cc: linux-kernel, linux-api, linux-doc, Jason Low, Scott J Norton,
Waiman Long
This patch moves a wakened task back to the optimistic spinning loop
if the owner is running.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
kernel/futex.c | 20 ++++++++++++++++----
1 files changed, 16 insertions(+), 4 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index ddc24d1..cca6457 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3215,8 +3215,8 @@ static inline int futex_spin_trylock(u32 __user *uaddr, u32 *puval,
* loop. Preemption should have been disabled before calling this function.
*
* The number of spinners may temporarily exceed the threshold due to
- * racing (the spin count check and add aren't atomic), but that shouldn't
- * be a big problem.
+ * racing and from waiter joining the OSQ(the spin count check and add
+ * aren't atomic), but that shouldn't be a real problem.
*/
static inline int futex_optspin(struct futex_q_head *qh,
struct futex_q_node *qn,
@@ -3297,7 +3297,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
struct futex_q_node qnode;
union futex_key key;
struct task_struct *owner;
- bool gotlock;
+ bool gotlock, slept;
int ret, cnt;
u32 uval, vpid, old;
@@ -3345,6 +3345,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
owner = ACCESS_ONCE(qh->owner);
if ((FUTEX_SPINCNT(qh) < futex_spincnt_max) &&
(!owner || owner->on_cpu))
+optspin:
if (futex_optspin(qh, &qnode, uaddr, vpid))
goto penable_out;
@@ -3356,7 +3357,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
list_add_tail(&qnode.wnode, &qh->waitq);
__set_current_state(TASK_INTERRUPTIBLE);
spin_unlock(&qh->wlock);
- gotlock = false;
+ slept = gotlock = false;
for (;;) {
ret = get_user(uval, uaddr);
if (ret)
@@ -3393,7 +3394,16 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
break;
}
+ /*
+ * Go back to spinning if the lock owner is running and the
+ * spinner limit hasn't been reached.
+ */
+ if (slept && (!owner || owner->on_cpu) &&
+ (FUTEX_SPINCNT(qh) < futex_spincnt_max))
+ break;
+
schedule_preempt_disabled();
+ slept = true;
/*
* Got a signal? Return EINTR
@@ -3427,6 +3437,8 @@ dequeue:
}
}
spin_unlock(&qh->wlock);
+ if (!ret && !gotlock)
+ goto optspin;
penable_out:
preempt_enable();
--
1.7.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [RFC PATCH 4/5] spinning futex: put waiting tasks in a sorted rbtree
2014-07-21 15:24 [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Waiman Long
` (2 preceding siblings ...)
2014-07-21 15:24 ` [RFC PATCH 3/5] spinning futex: move a wakened task to spinning Waiman Long
@ 2014-07-21 15:24 ` Waiman Long
2014-07-21 15:24 ` [RFC PATCH 5/5] futex, doc: add a document on how to use the spinning futexes Waiman Long
` (2 subsequent siblings)
6 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2014-07-21 15:24 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens
Cc: linux-kernel, linux-api, linux-doc, Jason Low, Scott J Norton,
Waiman Long
A simple FIFO list for the waiting tasks is easy to use. However, it
differs in behavior from the other futex primitives that the waiting
tasks are put in a priority list.
In order to make a sorted list ranked on priority as well as the
order they enter the kernel, we need to convert the simple list into
a rbtree.
This patch changes the waiting queue into an rbtree sorted first by
process priority followed by the order the tasks enter the kernel.
The optimistic spinning itself is strictly FIFO. There is no easy
way to make it priority based. In fact, the spinlock contention in
a wait-wake futex when the contending tasks enter the kernel also
serves as a kind of FIFO queue for processing the request. Priority
doesn't play a role there too.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
kernel/futex.c | 64 +++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 52 insertions(+), 12 deletions(-)
diff --git a/kernel/futex.c b/kernel/futex.c
index cca6457..12d855c 100644
--- a/kernel/futex.c
+++ b/kernel/futex.c
@@ -3011,10 +3011,11 @@ void exit_robust_list(struct task_struct *curr)
/**
* struct futex_q_head - head of the optspin futex queue, one per unique key
* @hnode: list entry from the hash bucket
- * @waitq: a list of waiting tasks
+ * @waitq: a rbtree of waiting tasks sorted by priority & sequence number
* @key: the key the futex is hashed on
* @osq: pointer to optimisitic spinning queue
* @owner: task_struct pointer of the futex owner
+ * @seq: last used sequence number + 1
* @otid: TID of the futex owner
* @wlock: spinlock for managing wait queue
* @lcnt: locker count (spinners + waiters)
@@ -3027,7 +3028,7 @@ void exit_robust_list(struct task_struct *curr)
* - enqueue into the spinner queue and wait for its turn.
* 4) For waiter:
* - lock futex queue head spinlock
- * - enqueue into the wait queue
+ * - enqueue into the wait rbtree queue
* - release the lock & sleep
* 5) For unlocker:
* - locate the queue head just like a locker does
@@ -3037,10 +3038,11 @@ void exit_robust_list(struct task_struct *curr)
*/
struct futex_q_head {
struct list_head hnode;
- struct list_head waitq;
+ struct rb_root waitq;
union futex_key key;
struct optimistic_spin_queue *osq;
struct task_struct *owner;
+ unsigned long seq;
pid_t otid;
spinlock_t wlock;
atomic_t lcnt;
@@ -3050,10 +3052,14 @@ struct futex_q_head {
* struct futex_q_node - a node in the optspin futex queue
* @wnode: a list entry for the waiting queue
* @task: task_struct pointer of the current task
+ * @seq: unique sequence number that shows order of kernel entry
+ * @prio: process priority
*/
struct futex_q_node {
- struct list_head wnode;
+ struct rb_node wnode;
struct task_struct *task;
+ unsigned long seq;
+ int prio;
};
/*
@@ -3113,7 +3119,7 @@ qhead_alloc_init(struct futex_hash_bucket *hb, union futex_key *key, u32 uval)
if (unlikely(!qh->owner))
qh->otid = 0;
atomic_set(&qh->lcnt, 1);
- INIT_LIST_HEAD(&qh->waitq);
+ qh->waitq = RB_ROOT;
spin_lock_init(&qh->wlock);
list_add(&qh->hnode, &hb->oslist);
}
@@ -3155,7 +3161,7 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
/*
* Free the queue head structure
*/
- BUG_ON(!list_empty(&qh->waitq) || qh->osq);
+ BUG_ON(!RB_EMPTY_ROOT(&qh->waitq) || qh->osq);
list_del(&qh->hnode);
spin_unlock(&hb->lock);
if (qh->owner)
@@ -3167,6 +3173,38 @@ qhead_free(struct futex_q_head *qh, struct futex_hash_bucket *hb)
}
/*
+ * qnode_insert - insert queue node into an rbtree sorted by priority
+ * and then the unique sequence number
+ */
+static inline void qnode_insert(struct rb_root *root, struct futex_q_node *node)
+{
+ struct rb_node **new = &(root->rb_node), *parent = NULL;
+
+ /*
+ * Locate where to put the new node
+ */
+ while (*new) {
+ struct futex_q_node *this = container_of(*new,
+ struct futex_q_node, wnode);
+ parent = *new;
+ if (likely(node->prio == this->prio)) {
+ if (node->seq < this->seq)
+ new = &(parent->rb_left);
+ else /* node->seq > this->seq */
+ new = &(parent->rb_right);
+ } else if (node->prio < this->prio) {
+ new = &(parent->rb_left);
+ } else {
+ new = &(parent->rb_right);
+ }
+ }
+
+ /* Add new node and rebalance tree */
+ rb_link_node(&node->wnode, parent, new);
+ rb_insert_color(&node->wnode, root);
+}
+
+/*
* futex_spin_trylock - attempt to take the lock
* Return: 1 if successful or an error happen
* 0 otherwise
@@ -3336,6 +3374,7 @@ static noinline int futex_spin_lock(u32 __user *uaddr, unsigned int flags)
} else {
atomic_inc(&qh->lcnt);
}
+ qnode.seq = qh->seq++;
spin_unlock(&hb->lock);
/*
@@ -3353,8 +3392,9 @@ optspin:
* Put the task into the wait queue and sleep
*/
get_task_struct(qnode.task);
+ qnode.prio = min(qnode.task->normal_prio, MAX_RT_PRIO);
spin_lock(&qh->wlock);
- list_add_tail(&qnode.wnode, &qh->waitq);
+ qnode_insert(&qh->waitq, &qnode);
__set_current_state(TASK_INTERRUPTIBLE);
spin_unlock(&qh->wlock);
slept = gotlock = false;
@@ -3421,12 +3461,12 @@ dequeue:
*/
put_task_struct(qnode.task);
spin_lock(&qh->wlock);
- list_del(&qnode.wnode);
+ rb_erase(&qnode.wnode, &qh->waitq);
/*
* Try to clear the waiter bit if the wait queue is empty
*/
- if (list_empty(&qh->waitq)) {
+ if (RB_EMPTY_ROOT(&qh->waitq)) {
int retval = get_futex_value_locked(&uval, uaddr);
if (!retval && FUTEX_HAS_WAITERS(uval)) {
@@ -3529,9 +3569,9 @@ static noinline int futex_spin_unlock(u32 __user *uaddr, unsigned int flags)
lcnt = atomic_read(&qh->lcnt);
if (lcnt) {
spin_lock(&qh->wlock);
- if (!list_empty(&qh->waitq))
- wtask = list_entry(qh->waitq.next, struct futex_q_node,
- wnode)->task;
+ if (!RB_EMPTY_ROOT(&qh->waitq))
+ wtask = container_of(rb_first(&qh->waitq),
+ struct futex_q_node, wnode)->task;
spin_unlock(&qh->wlock);
}
spin_unlock(&hb->lock);
--
1.7.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* [RFC PATCH 5/5] futex, doc: add a document on how to use the spinning futexes
2014-07-21 15:24 [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Waiman Long
` (3 preceding siblings ...)
2014-07-21 15:24 ` [RFC PATCH 4/5] spinning futex: put waiting tasks in a sorted rbtree Waiman Long
@ 2014-07-21 15:24 ` Waiman Long
2014-07-21 15:45 ` Randy Dunlap
2014-07-21 16:42 ` [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Andi Kleen
2014-07-21 21:18 ` Ingo Molnar
6 siblings, 1 reply; 51+ messages in thread
From: Waiman Long @ 2014-07-21 15:24 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens
Cc: linux-kernel, linux-api, linux-doc, Jason Low, Scott J Norton,
Waiman Long
This patch adds a new document file on how to use the spinning futexes.
Signed-off-by: Waiman Long <Waiman.Long@hp.com>
---
Documentation/spinning-futex.txt | 109 ++++++++++++++++++++++++++++++++++++++
1 files changed, 109 insertions(+), 0 deletions(-)
create mode 100644 Documentation/spinning-futex.txt
diff --git a/Documentation/spinning-futex.txt b/Documentation/spinning-futex.txt
new file mode 100644
index 0000000..e3cb5a2
--- /dev/null
+++ b/Documentation/spinning-futex.txt
@@ -0,0 +1,109 @@
+Started by: Waiman Long <waiman.long@hp.com>
+
+Spinning Futex
+--------------
+
+There are two main problems for a wait-wake futex (FUTEX_WAIT and
+FUTEX_WAKE) when used for creating user-space lock primitives:
+
+ 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
+ in the futex queue to be woken up by the lock owner when it is done
+ with the lock. Waking up a sleeping task, however, introduces some
+ additional latency which can be large especially if the critical
+ section protected by the lock is relatively short. This may cause
+ a performance bottleneck on large systems with many CPUs running
+ applications that need a lot of inter-thread synchronization.
+
+ 2) The performance of the wait-wake futex is currently
+ spinlock-constrained. When many threads are contending for a
+ futex in a large system with many CPUs, it is not unusual to have
+ spinlock contention accounting for more than 90% of the total
+ CPU cycles consumed at various points in time.
+
+Spinning futex is a solution to both the wakeup latency and spinlock
+contention problems by optimistically spinning on a locked futex
+when the lock owner is running within the kernel until the lock is
+free. This is the same optimistic spinning mechanism used by the kernel
+mutex and rw semaphore implementations to improve performance. The
+optimistic spinning was done without taking any lock.
+
+Implementation
+--------------
+
+Like the PI and robust futexes, a lock acquirer has to atomically
+put its thread ID (TID) into the lower 30 bits of the 32-bit futex
+which should has an original value of 0. If it succeeds, it will be
+the owner of the futex. Otherwise, it has to call into the kernel
+using the new FUTEX_SPIN_LOCK futex(2) syscall.
+
+The kernel will use the setting of the most significant bit
+(FUTEX_WAITERS) in the futex value to indicate one or more waiters
+are sleeping and need to be woken up later on.
+
+When it is time to unlock, the lock owner has to atomically clear
+the TID portion of the futex value. If the FUTEX_WAITERS bit is set,
+it has to issue a FUTEX_SPIN_UNLOCK futex system call to wake up the
+sleeping task.
+
+A return value of 1 from the FUTEX_SPIN_UNLOCK futex(2) syscall
+indicates a task has been woken up. The syscall returns 0 if no
+sleeping task is found or spinners are present to take the lock.
+
+The error number returned by a FUTEX_SPIN_UNLOCK call on an empty
+futex can be used to decide if the spinning futex functionality is
+implemented in the kernel. If it is present, the returned error number
+should be ESRCH. Otherwise it will be ENOSYS.
+
+Currently, only the first and the second arguments (the futex address
+and the opcode) of the futex(2) syscall is used. All the other
+arguments must be set to 0 or NULL to avoid forward compatibility
+problem.
+
+The spinning futex requires the kernel to have support for the cmpxchg
+functionality. For architectures that don't support cmpxchg, spinning
+futex will not be supported as well.
+
+Usage Scenario
+--------------
+
+A spinning futex can be used as an exclusive lock to guard a critical
+section which are unlikely to go to sleep in the kernel. The spinners
+in a spinning futex, however, will fall back to sleep in a wait queue
+if the lock owner isn't running. Therefore, it can also be used when
+the critical section is long and prone to sleeping. However, it may
+not have the performance benefit when compared with a wait-wake futex
+in this case.
+
+Sample Code
+-----------
+
+The following are sample code to implement a simple lock and unlock
+function.
+
+__thread int tid; /* Thread ID */
+
+void mutex_lock(int *faddr)
+{
+ if (cmpxchg(faddr, 0, tid) == 0)
+ return;
+ for (;;)
+ if (futex(faddr, FUTEX_SPIN_LOCK, ...) == 0)
+ break;
+}
+
+void mutex_unlock(int *faddr)
+{
+ int old, fval;
+
+ if ((fval = cmpxchg(faddr, tid, 0)) == tid)
+ return;
+ /* Clear only the TID portion of the futex */
+ for (;;) {
+ old = fval;
+ fval = cmpxchg(faddr, old, old & ~FUTEX_TID_MASK);
+ if (fval == old)
+ break;
+ }
+ if (fval & FUTEX_WAITERS)
+ futex(faddr, FUTEX_SPIN_UNLOCK, ...);
+}
--
1.7.1
^ permalink raw reply related [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 5/5] futex, doc: add a document on how to use the spinning futexes
2014-07-21 15:24 ` [RFC PATCH 5/5] futex, doc: add a document on how to use the spinning futexes Waiman Long
@ 2014-07-21 15:45 ` Randy Dunlap
2014-07-22 3:19 ` Waiman Long
0 siblings, 1 reply; 51+ messages in thread
From: Randy Dunlap @ 2014-07-21 15:45 UTC (permalink / raw)
To: Waiman Long, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
Darren Hart, Davidlohr Bueso, Heiko Carstens
Cc: linux-kernel, linux-api, linux-doc, Jason Low, Scott J Norton
On 07/21/2014 08:24 AM, Waiman Long wrote:
> This patch adds a new document file on how to use the spinning futexes.
>
> Signed-off-by: Waiman Long <Waiman.Long@hp.com>
> ---
> Documentation/spinning-futex.txt | 109 ++++++++++++++++++++++++++++++++++++++
> 1 files changed, 109 insertions(+), 0 deletions(-)
> create mode 100644 Documentation/spinning-futex.txt
>
> diff --git a/Documentation/spinning-futex.txt b/Documentation/spinning-futex.txt
> new file mode 100644
> index 0000000..e3cb5a2
> --- /dev/null
> +++ b/Documentation/spinning-futex.txt
> @@ -0,0 +1,109 @@
> +Started by: Waiman Long <waiman.long@hp.com>
> +
> +Spinning Futex
> +--------------
> +
> +There are two main problems for a wait-wake futex (FUTEX_WAIT and
> +FUTEX_WAKE) when used for creating user-space lock primitives:
> +
> + 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
> + in the futex queue to be woken up by the lock owner when it is done
> + with the lock. Waking up a sleeping task, however, introduces some
> + additional latency which can be large especially if the critical
> + section protected by the lock is relatively short. This may cause
> + a performance bottleneck on large systems with many CPUs running
> + applications that need a lot of inter-thread synchronization.
> +
> + 2) The performance of the wait-wake futex is currently
> + spinlock-constrained. When many threads are contending for a
> + futex in a large system with many CPUs, it is not unusual to have
> + spinlock contention accounting for more than 90% of the total
> + CPU cycles consumed at various points in time.
> +
> +Spinning futex is a solution to both the wakeup latency and spinlock
> +contention problems by optimistically spinning on a locked futex
> +when the lock owner is running within the kernel until the lock is
> +free. This is the same optimistic spinning mechanism used by the kernel
> +mutex and rw semaphore implementations to improve performance. The
> +optimistic spinning was done without taking any lock.
is done
> +
> +Implementation
> +--------------
> +
> +Like the PI and robust futexes, a lock acquirer has to atomically
> +put its thread ID (TID) into the lower 30 bits of the 32-bit futex
> +which should has an original value of 0. If it succeeds, it will be
have
> +the owner of the futex. Otherwise, it has to call into the kernel
> +using the new FUTEX_SPIN_LOCK futex(2) syscall.
> +
> +The kernel will use the setting of the most significant bit
> +(FUTEX_WAITERS) in the futex value to indicate one or more waiters
> +are sleeping and need to be woken up later on.
> +
> +When it is time to unlock, the lock owner has to atomically clear
> +the TID portion of the futex value. If the FUTEX_WAITERS bit is set,
> +it has to issue a FUTEX_SPIN_UNLOCK futex system call to wake up the
> +sleeping task.
> +
> +A return value of 1 from the FUTEX_SPIN_UNLOCK futex(2) syscall
> +indicates a task has been woken up. The syscall returns 0 if no
> +sleeping task is found or spinners are present to take the lock.
> +
> +The error number returned by a FUTEX_SPIN_UNLOCK call on an empty
> +futex can be used to decide if the spinning futex functionality is
> +implemented in the kernel. If it is present, the returned error number
> +should be ESRCH. Otherwise it will be ENOSYS.
> +
> +Currently, only the first and the second arguments (the futex address
> +and the opcode) of the futex(2) syscall is used. All the other
are used.
> +arguments must be set to 0 or NULL to avoid forward compatibility
> +problem.
> +
> +The spinning futex requires the kernel to have support for the cmpxchg
> +functionality. For architectures that don't support cmpxchg, spinning
> +futex will not be supported as well.
> +
> +Usage Scenario
> +--------------
> +
> +A spinning futex can be used as an exclusive lock to guard a critical
> +section which are unlikely to go to sleep in the kernel. The spinners
is
> +in a spinning futex, however, will fall back to sleep in a wait queue
> +if the lock owner isn't running. Therefore, it can also be used when
> +the critical section is long and prone to sleeping. However, it may
> +not have the performance benefit when compared with a wait-wake futex
> +in this case.
> +
> +Sample Code
> +-----------
> +
> +The following are sample code to implement a simple lock and unlock
is
> +function.
> +
> +__thread int tid; /* Thread ID */
> +
> +void mutex_lock(int *faddr)
> +{
> + if (cmpxchg(faddr, 0, tid) == 0)
> + return;
> + for (;;)
> + if (futex(faddr, FUTEX_SPIN_LOCK, ...) == 0)
> + break;
> +}
> +
> +void mutex_unlock(int *faddr)
> +{
> + int old, fval;
> +
> + if ((fval = cmpxchg(faddr, tid, 0)) == tid)
> + return;
> + /* Clear only the TID portion of the futex */
> + for (;;) {
> + old = fval;
> + fval = cmpxchg(faddr, old, old & ~FUTEX_TID_MASK);
> + if (fval == old)
> + break;
> + }
> + if (fval & FUTEX_WAITERS)
> + futex(faddr, FUTEX_SPIN_UNLOCK, ...);
> +}
>
--
~Randy
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 5/5] futex, doc: add a document on how to use the spinning futexes
2014-07-21 15:45 ` Randy Dunlap
@ 2014-07-22 3:19 ` Waiman Long
0 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2014-07-22 3:19 UTC (permalink / raw)
To: Randy Dunlap
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens, linux-kernel, linux-api,
linux-doc, Jason Low, Scott J Norton
On 07/21/2014 11:45 AM, Randy Dunlap wrote:
> On 07/21/2014 08:24 AM, Waiman Long wrote:
>> This patch adds a new document file on how to use the spinning futexes.
>>
>> Signed-off-by: Waiman Long<Waiman.Long@hp.com>
>> ---
>> Documentation/spinning-futex.txt | 109 ++++++++++++++++++++++++++++++++++++++
>> 1 files changed, 109 insertions(+), 0 deletions(-)
>> create mode 100644 Documentation/spinning-futex.txt
>>
>> diff --git a/Documentation/spinning-futex.txt b/Documentation/spinning-futex.txt
>> new file mode 100644
>> index 0000000..e3cb5a2
>> --- /dev/null
>> +++ b/Documentation/spinning-futex.txt
>> @@ -0,0 +1,109 @@
>> +Started by: Waiman Long<waiman.long@hp.com>
>> +
>> +Spinning Futex
>> +--------------
>> +
>> +There are two main problems for a wait-wake futex (FUTEX_WAIT and
>> +FUTEX_WAKE) when used for creating user-space lock primitives:
>> +
>> + 1) With a wait-wake futex, tasks waiting for a lock are put to sleep
>> + in the futex queue to be woken up by the lock owner when it is done
>> + with the lock. Waking up a sleeping task, however, introduces some
>> + additional latency which can be large especially if the critical
>> + section protected by the lock is relatively short. This may cause
>> + a performance bottleneck on large systems with many CPUs running
>> + applications that need a lot of inter-thread synchronization.
>> +
>> + 2) The performance of the wait-wake futex is currently
>> + spinlock-constrained. When many threads are contending for a
>> + futex in a large system with many CPUs, it is not unusual to have
>> + spinlock contention accounting for more than 90% of the total
>> + CPU cycles consumed at various points in time.
>> +
>> +Spinning futex is a solution to both the wakeup latency and spinlock
>> +contention problems by optimistically spinning on a locked futex
>> +when the lock owner is running within the kernel until the lock is
>> +free. This is the same optimistic spinning mechanism used by the kernel
>> +mutex and rw semaphore implementations to improve performance. The
>> +optimistic spinning was done without taking any lock.
> is done
>
>> +
>> +Implementation
>> +--------------
>> +
>> +Like the PI and robust futexes, a lock acquirer has to atomically
>> +put its thread ID (TID) into the lower 30 bits of the 32-bit futex
>> +which should has an original value of 0. If it succeeds, it will be
> have
>
>> +the owner of the futex. Otherwise, it has to call into the kernel
>> +using the new FUTEX_SPIN_LOCK futex(2) syscall.
>> +
>> +The kernel will use the setting of the most significant bit
>> +(FUTEX_WAITERS) in the futex value to indicate one or more waiters
>> +are sleeping and need to be woken up later on.
>> +
>> +When it is time to unlock, the lock owner has to atomically clear
>> +the TID portion of the futex value. If the FUTEX_WAITERS bit is set,
>> +it has to issue a FUTEX_SPIN_UNLOCK futex system call to wake up the
>> +sleeping task.
>> +
>> +A return value of 1 from the FUTEX_SPIN_UNLOCK futex(2) syscall
>> +indicates a task has been woken up. The syscall returns 0 if no
>> +sleeping task is found or spinners are present to take the lock.
>> +
>> +The error number returned by a FUTEX_SPIN_UNLOCK call on an empty
>> +futex can be used to decide if the spinning futex functionality is
>> +implemented in the kernel. If it is present, the returned error number
>> +should be ESRCH. Otherwise it will be ENOSYS.
>> +
>> +Currently, only the first and the second arguments (the futex address
>> +and the opcode) of the futex(2) syscall is used. All the other
> are used.
>
>> +arguments must be set to 0 or NULL to avoid forward compatibility
>> +problem.
>> +
>> +The spinning futex requires the kernel to have support for the cmpxchg
>> +functionality. For architectures that don't support cmpxchg, spinning
>> +futex will not be supported as well.
>> +
>> +Usage Scenario
>> +--------------
>> +
>> +A spinning futex can be used as an exclusive lock to guard a critical
>> +section which are unlikely to go to sleep in the kernel. The spinners
> is
>
>> +in a spinning futex, however, will fall back to sleep in a wait queue
>> +if the lock owner isn't running. Therefore, it can also be used when
>> +the critical section is long and prone to sleeping. However, it may
>> +not have the performance benefit when compared with a wait-wake futex
>> +in this case.
>> +
>> +Sample Code
>> +-----------
>> +
>> +The following are sample code to implement a simple lock and unlock
> is
>
>> +function.
>> +
>> +__thread int tid; /* Thread ID */
>> +
>> +void mutex_lock(int *faddr)
>> +{
>> + if (cmpxchg(faddr, 0, tid) == 0)
>> + return;
>> + for (;;)
>> + if (futex(faddr, FUTEX_SPIN_LOCK, ...) == 0)
>> + break;
>> +}
>> +
>> +void mutex_unlock(int *faddr)
>> +{
>> + int old, fval;
>> +
>> + if ((fval = cmpxchg(faddr, tid, 0)) == tid)
>> + return;
>> + /* Clear only the TID portion of the futex */
>> + for (;;) {
>> + old = fval;
>> + fval = cmpxchg(faddr, old, old& ~FUTEX_TID_MASK);
>> + if (fval == old)
>> + break;
>> + }
>> + if (fval& FUTEX_WAITERS)
>> + futex(faddr, FUTEX_SPIN_UNLOCK, ...);
>> +}
>>
>
Thank for the grammar corrections. Will apply those to the documents.
-Longman
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-21 15:24 [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Waiman Long
` (4 preceding siblings ...)
2014-07-21 15:24 ` [RFC PATCH 5/5] futex, doc: add a document on how to use the spinning futexes Waiman Long
@ 2014-07-21 16:42 ` Andi Kleen
2014-07-21 16:45 ` Andi Kleen
` (2 more replies)
2014-07-21 21:18 ` Ingo Molnar
6 siblings, 3 replies; 51+ messages in thread
From: Andi Kleen @ 2014-07-21 16:42 UTC (permalink / raw)
To: Waiman Long
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens, linux-kernel, linux-api,
linux-doc, Jason Low, Scott J Norton
Waiman Long <Waiman.Long@hp.com> writes:
> This patch series introduces two new futex command codes to support
> a new optimistic spinning futex for implementing an exclusive lock
> primitive that should perform better than the same primitive using
> the wait-wake futex in cases where the lock owner is actively working
> instead of waiting for I/O completion.
How would you distinguish those two cases automatically?
>
> This patch series improves futex performance on two different fronts:
> 1) Reducing the amount of the futex spinlock contention by using 2
> different spinlocks instead of just one for the wait-wake futex.
> 2) Eliminating the context switching overhead and latency due to the
> sleeping and the waking of the waiting tasks.
FWIW the main problem is currently that switch-through-idle is so
slow. I think improving that would give a boost to far more
situations.
> Patch 4 changes sleeping queue from a simple FIFO list to an rbtree
> sorted by process priority as well as the sequeunce the tasks enter
> the kernel.
This seems to mix new functionality with performance improvements?
Generally adding new ordering in an user interface is risky because
often user programs have been tuned for specific old ordering.
-Andi
--
ak@linux.intel.com -- Speaking for myself only
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-21 16:42 ` [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Andi Kleen
@ 2014-07-21 16:45 ` Andi Kleen
[not found] ` <871tte3bjw.fsf-KWJ+5VKanrL29G5dvP0v1laTQe2KTcn/@public.gmane.org>
` (3 more replies)
2014-07-22 18:28 ` Waiman Long
[not found] ` <8761iq3bp3.fsf-KWJ+5VKanrL29G5dvP0v1laTQe2KTcn/@public.gmane.org>
2 siblings, 4 replies; 51+ messages in thread
From: Andi Kleen @ 2014-07-21 16:45 UTC (permalink / raw)
To: Waiman Long
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens, linux-kernel, linux-api,
linux-doc, Jason Low, Scott J Norton
Andi Kleen <andi@firstfloor.org> writes:
> Waiman Long <Waiman.Long@hp.com> writes:
>
>> This patch series introduces two new futex command codes to support
>> a new optimistic spinning futex for implementing an exclusive lock
>> primitive that should perform better than the same primitive using
>> the wait-wake futex in cases where the lock owner is actively working
>> instead of waiting for I/O completion.
>
> How would you distinguish those two cases automatically?
Also BTW traditionally the spinning was just done in user space.
This would be always even more efficient, because it would
even avoid the syscall entry path.
Given the glibc adaptive mutexes are not very good, but I'm sure those
could be improved.
-Andi
--
ak@linux.intel.com -- Speaking for myself only
^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <871tte3bjw.fsf-KWJ+5VKanrL29G5dvP0v1laTQe2KTcn/@public.gmane.org>]
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
[not found] ` <871tte3bjw.fsf-KWJ+5VKanrL29G5dvP0v1laTQe2KTcn/@public.gmane.org>
@ 2014-07-21 17:20 ` Darren Hart
0 siblings, 0 replies; 51+ messages in thread
From: Darren Hart @ 2014-07-21 17:20 UTC (permalink / raw)
To: Andi Kleen, Waiman Long
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Davidlohr Bueso,
Heiko Carstens, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, Jason Low, Scott J Norton
On 7/21/14, 9:45, "Andi Kleen" <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org> wrote:
>Andi Kleen <andi-Vw/NltI1exuRpAAqCnN02g@public.gmane.org> writes:
>
>> Waiman Long <Waiman.Long-VXdhtT5mjnY@public.gmane.org> writes:
>>
>>> This patch series introduces two new futex command codes to support
>>> a new optimistic spinning futex for implementing an exclusive lock
>>> primitive that should perform better than the same primitive using
>>> the wait-wake futex in cases where the lock owner is actively working
>>> instead of waiting for I/O completion.
>>
>> How would you distinguish those two cases automatically?
>
>Also BTW traditionally the spinning was just done in user space.
>
>This would be always even more efficient, because it would
>even avoid the syscall entry path.
>
>Given the glibc adaptive mutexes are not very good, but I'm sure those
>could be improved.
I presented on something along these lines a few years back, and various
people have asked for the sample code over the years. Andi is right, doing
this from user-space has the potential to be more efficient, the challenge
is getting access to privileged information, such as the state of the
mutex owner. You don't want to spin if the owner is sleeping. I did this
by adding a tidrunning call to the vdso. My somewhat hacked solution is
still available here:
http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/t
idrunning/dev
I abandoned the spinning in the kernel thing due to the overhead of the
system call if I remember correctly. Also available here:
http://git.infradead.org/users/dvhart/linux.git/shortlog/refs/heads/futex/f
utex-lock-adaptive
--
Darren Hart Open Source Technology Center
darren.hart-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org Intel Corporation
^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <CFF29A00.9D44A%dvhart@linux.intel.com>]
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-21 16:45 ` Andi Kleen
[not found] ` <871tte3bjw.fsf-KWJ+5VKanrL29G5dvP0v1laTQe2KTcn/@public.gmane.org>
[not found] ` <CFF29A00.9D44A%dvhart@linux.intel.com>
@ 2014-07-21 18:24 ` Thomas Gleixner
2014-07-22 18:35 ` Waiman Long
3 siblings, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2014-07-21 18:24 UTC (permalink / raw)
To: Andi Kleen
Cc: Waiman Long, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens, linux-kernel, linux-api,
linux-doc, Jason Low, Scott J Norton
On Mon, 21 Jul 2014, Andi Kleen wrote:
> Andi Kleen <andi@firstfloor.org> writes:
>
> > Waiman Long <Waiman.Long@hp.com> writes:
> >
> >> This patch series introduces two new futex command codes to support
> >> a new optimistic spinning futex for implementing an exclusive lock
> >> primitive that should perform better than the same primitive using
> >> the wait-wake futex in cases where the lock owner is actively working
> >> instead of waiting for I/O completion.
> >
> > How would you distinguish those two cases automatically?
>
> Also BTW traditionally the spinning was just done in user space.
And traditionally that is based on heuristics, because user space
cannot figure out whether the lock owner is on the CPU or not.
And heuristics suck no matter what.
There have been attempts to expose that information to user space, but
that sucks as well, as it requires updates of that information from
the guts of the scheduler, does not scale at all and what's worse, it
has security implications.
Thanks,
tglx
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-21 16:45 ` Andi Kleen
` (2 preceding siblings ...)
2014-07-21 18:24 ` Thomas Gleixner
@ 2014-07-22 18:35 ` Waiman Long
3 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2014-07-22 18:35 UTC (permalink / raw)
To: Andi Kleen
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens, linux-kernel, linux-api,
linux-doc, Jason Low, Scott J Norton
On 07/21/2014 12:45 PM, Andi Kleen wrote:
> Andi Kleen<andi@firstfloor.org> writes:
>
>> Waiman Long<Waiman.Long@hp.com> writes:
>>
>>> This patch series introduces two new futex command codes to support
>>> a new optimistic spinning futex for implementing an exclusive lock
>>> primitive that should perform better than the same primitive using
>>> the wait-wake futex in cases where the lock owner is actively working
>>> instead of waiting for I/O completion.
>> How would you distinguish those two cases automatically?
> Also BTW traditionally the spinning was just done in user space.
>
> This would be always even more efficient, because it would
> even avoid the syscall entry path.
Spinning directly in userspace is useful if the critical section is
really short and the chance of heavy contention is small. However, if
the lock can be heavily contended and the critical section is not short,
these are the conditions where a spinning futex proposed by this patch
series can help.
-Longman
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-21 16:42 ` [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Andi Kleen
2014-07-21 16:45 ` Andi Kleen
@ 2014-07-22 18:28 ` Waiman Long
[not found] ` <8761iq3bp3.fsf-KWJ+5VKanrL29G5dvP0v1laTQe2KTcn/@public.gmane.org>
2 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2014-07-22 18:28 UTC (permalink / raw)
To: Andi Kleen
Cc: Thomas Gleixner, Ingo Molnar, Peter Zijlstra, Darren Hart,
Davidlohr Bueso, Heiko Carstens, linux-kernel, linux-api,
linux-doc, Jason Low, Scott J Norton
On 07/21/2014 12:42 PM, Andi Kleen wrote:
> Waiman Long<Waiman.Long@hp.com> writes:
>
>> This patch series introduces two new futex command codes to support
>> a new optimistic spinning futex for implementing an exclusive lock
>> primitive that should perform better than the same primitive using
>> the wait-wake futex in cases where the lock owner is actively working
>> instead of waiting for I/O completion.
> How would you distinguish those two cases automatically?
I don't really distinguish between these two. The purpose of this
paragraph is to show the best use cases for the spinning futexes is when
the lock owner is not likely to sleep while in the critical section.
>> This patch series improves futex performance on two different fronts:
>> 1) Reducing the amount of the futex spinlock contention by using 2
>> different spinlocks instead of just one for the wait-wake futex.
>> 2) Eliminating the context switching overhead and latency due to the
>> sleeping and the waking of the waiting tasks.
> FWIW the main problem is currently that switch-through-idle is so
> slow. I think improving that would give a boost to far more
> situations.
If we can improve the context switching overhead, that can certainly
help a lot of applications.
>
>> Patch 4 changes sleeping queue from a simple FIFO list to an rbtree
>> sorted by process priority as well as the sequeunce the tasks enter
>> the kernel.
> This seems to mix new functionality with performance improvements?
>
> Generally adding new ordering in an user interface is risky because
> often user programs have been tuned for specific old ordering.
Not really, this patch is to emulate the current behavior of the
wait-wake futex which is queued in a priority queue. Because of
spinning, tasks may enter into sleep state in different order than when
they enter into the kernel. That is why I keep a sequence number to
track who go into the kernel first.
-Longman
^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <8761iq3bp3.fsf-KWJ+5VKanrL29G5dvP0v1laTQe2KTcn/@public.gmane.org>]
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
[not found] ` <8761iq3bp3.fsf-KWJ+5VKanrL29G5dvP0v1laTQe2KTcn/@public.gmane.org>
@ 2014-07-23 4:55 ` Mike Galbraith
2014-07-23 6:57 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2014-07-23 4:55 UTC (permalink / raw)
To: Andi Kleen
Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, Peter Zijlstra,
Darren Hart, Davidlohr Bueso, Heiko Carstens,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, Jason Low, Scott J Norton
On Mon, 2014-07-21 at 09:42 -0700, Andi Kleen wrote:
> FWIW the main problem is currently that switch-through-idle is so
> slow. I think improving that would give a boost to far more
> situations.
Two high frequency idle enter/exit suckage spots:
1) nohz (tick) - it's expensive to start/stop tick on every micro-idle,
throttle it or something.
2) ondemand governor - tweak silly default settings to reflect the
reality that we routinely schedule communicating threads cross core.
(3. seek/destroy fastpath cycles, damn things multiply over time)
-Mike
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-23 4:55 ` Mike Galbraith
@ 2014-07-23 6:57 ` Peter Zijlstra
2014-07-23 7:25 ` Mike Galbraith
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2014-07-23 6:57 UTC (permalink / raw)
To: Mike Galbraith
Cc: Andi Kleen, Waiman Long, Thomas Gleixner, Ingo Molnar,
Darren Hart, Davidlohr Bueso, Heiko Carstens, linux-kernel,
linux-api, linux-doc, Jason Low, Scott J Norton
On Wed, Jul 23, 2014 at 06:55:03AM +0200, Mike Galbraith wrote:
> On Mon, 2014-07-21 at 09:42 -0700, Andi Kleen wrote:
>
> > FWIW the main problem is currently that switch-through-idle is so
> > slow. I think improving that would give a boost to far more
> > situations.
>
> Two high frequency idle enter/exit suckage spots:
>
> 1) nohz (tick) - it's expensive to start/stop tick on every micro-idle,
> throttle it or something.
Yeah, so the idea was to use the cpuidle idle guestimator to control
this, and now that we've moved it somewhat closer to the scheduler that
might become possible.
> 2) ondemand governor - tweak silly default settings to reflect the
> reality that we routinely schedule communicating threads cross core.
Yeah, so the plan is to shoot cpufreq in the head and base the
replacement on smp aware metrics ;-) Its on a todo list somewhere..
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-23 6:57 ` Peter Zijlstra
@ 2014-07-23 7:25 ` Mike Galbraith
2014-07-23 7:35 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2014-07-23 7:25 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andi Kleen, Waiman Long, Thomas Gleixner, Ingo Molnar,
Darren Hart, Davidlohr Bueso, Heiko Carstens,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, Jason Low, Scott J Norton
On Wed, 2014-07-23 at 08:57 +0200, Peter Zijlstra wrote:
> On Wed, Jul 23, 2014 at 06:55:03AM +0200, Mike Galbraith wrote:
> > On Mon, 2014-07-21 at 09:42 -0700, Andi Kleen wrote:
> >
> > > FWIW the main problem is currently that switch-through-idle is so
> > > slow. I think improving that would give a boost to far more
> > > situations.
> >
> > Two high frequency idle enter/exit suckage spots:
> >
> > 1) nohz (tick) - it's expensive to start/stop tick on every micro-idle,
> > throttle it or something.
>
> Yeah, so the idea was to use the cpuidle idle guestimator to control
> this, and now that we've moved it somewhat closer to the scheduler that
> might become possible.
>
> > 2) ondemand governor - tweak silly default settings to reflect the
> > reality that we routinely schedule communicating threads cross core.
>
> Yeah, so the plan is to shoot cpufreq in the head and base the
> replacement on smp aware metrics ;-) Its on a todo list somewhere..
It never ceases to amaze me that people aren't screaming bloody murder
about those two spots. Watching performance of lightly loaded boxen is
enough to make a grown man cry.
SUSE (and I in all of my many regression testing trees) puts tourniquets
on both of these blood spurting gashes, laptops be damned.
I also resurrect mwait_idle(), as while you may consider it obsolete, I
still love my lovely little Q6600 box (power sucking pig) dearly :)
-Mike
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-23 7:25 ` Mike Galbraith
@ 2014-07-23 7:35 ` Peter Zijlstra
2014-07-23 7:39 ` Mike Galbraith
0 siblings, 1 reply; 51+ messages in thread
From: Peter Zijlstra @ 2014-07-23 7:35 UTC (permalink / raw)
To: Mike Galbraith
Cc: Andi Kleen, Waiman Long, Thomas Gleixner, Ingo Molnar,
Darren Hart, Davidlohr Bueso, Heiko Carstens, linux-kernel,
linux-api, linux-doc, Jason Low, Scott J Norton
On Wed, Jul 23, 2014 at 09:25:46AM +0200, Mike Galbraith wrote:
> I also resurrect mwait_idle(), as while you may consider it obsolete, I
> still love my lovely little Q6600 box (power sucking pig) dearly :)
Yeah, I keep forgetting about that one.. we should get that fixed.
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-23 7:35 ` Peter Zijlstra
@ 2014-07-23 7:39 ` Mike Galbraith
2014-07-23 7:52 ` Peter Zijlstra
0 siblings, 1 reply; 51+ messages in thread
From: Mike Galbraith @ 2014-07-23 7:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Andi Kleen, Waiman Long, Thomas Gleixner, Ingo Molnar,
Darren Hart, Davidlohr Bueso, Heiko Carstens,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, Jason Low, Scott J Norton
On Wed, 2014-07-23 at 09:35 +0200, Peter Zijlstra wrote:
> On Wed, Jul 23, 2014 at 09:25:46AM +0200, Mike Galbraith wrote:
> > I also resurrect mwait_idle(), as while you may consider it obsolete, I
> > still love my lovely little Q6600 box (power sucking pig) dearly :)
>
> Yeah, I keep forgetting about that one.. we should get that fixed.
(it is an obsolete pig, and nobody else seems to mind, so I don't mind)
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-23 7:39 ` Mike Galbraith
@ 2014-07-23 7:52 ` Peter Zijlstra
0 siblings, 0 replies; 51+ messages in thread
From: Peter Zijlstra @ 2014-07-23 7:52 UTC (permalink / raw)
To: Mike Galbraith
Cc: Andi Kleen, Waiman Long, Thomas Gleixner, Ingo Molnar,
Darren Hart, Davidlohr Bueso, Heiko Carstens, linux-kernel,
linux-api, linux-doc, Jason Low, Scott J Norton
On Wed, Jul 23, 2014 at 09:39:03AM +0200, Mike Galbraith wrote:
> On Wed, 2014-07-23 at 09:35 +0200, Peter Zijlstra wrote:
> > On Wed, Jul 23, 2014 at 09:25:46AM +0200, Mike Galbraith wrote:
> > > I also resurrect mwait_idle(), as while you may consider it obsolete, I
> > > still love my lovely little Q6600 box (power sucking pig) dearly :)
> >
> > Yeah, I keep forgetting about that one.. we should get that fixed.
>
> (it is an obsolete pig, and nobody else seems to mind, so I don't mind)
Hey, I fixed early P4 topology setup yesterday, core2 is like brand
spanking new compared :-)
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-21 15:24 [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Waiman Long
` (5 preceding siblings ...)
2014-07-21 16:42 ` [RFC PATCH 0/5] futex: introduce an optimistic spinning futex Andi Kleen
@ 2014-07-21 21:18 ` Ingo Molnar
2014-07-21 21:41 ` Thomas Gleixner
[not found] ` <20140721211801.GA12149-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
6 siblings, 2 replies; 51+ messages in thread
From: Ingo Molnar @ 2014-07-21 21:18 UTC (permalink / raw)
To: Waiman Long
Cc: Thomas Gleixner, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
Heiko Carstens, linux-kernel, linux-api, linux-doc, Jason Low,
Scott J Norton
* Waiman Long <Waiman.Long@hp.com> wrote:
> Testing done on a 4-socket Westmere-EX boxes with 40 cores (HT off)
> showed the following performance data (average kops/s) with various
> load factor (number of pause instructions) used in the critical
> section using an userspace mutex microbenchmark.
>
> Threads Load Waiting Futex Spinning Futex %Change
> ------- ---- ------------- -------------- -------
> 256 1 6894 8883 +29%
> 256 10 3656 4912 +34%
> 256 50 1332 4358 +227%
> 256 100 792 2753 +248%
> 10 1 6382 4838 -24%
> 10 10 3614 4748 +31%
> 10 50 1319 3900 +196%
> 10 100 782 2459 +214%
> 2 1 7905 7194 -9.0%
> 2 10 4556 4717 +3.5%
> 2 50 2191 4167 +90%
> 2 100 1767 2407 +36%
So the numbers look interesting - but it would be _really_ important
to provide noise/sttdev figures in a sixth column as well (denoted in
percentage units, not in benchmark units), so that we know how
significant a particular speedup (or slowdown) is.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 51+ messages in thread
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
2014-07-21 21:18 ` Ingo Molnar
@ 2014-07-21 21:41 ` Thomas Gleixner
[not found] ` <20140721211801.GA12149-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
1 sibling, 0 replies; 51+ messages in thread
From: Thomas Gleixner @ 2014-07-21 21:41 UTC (permalink / raw)
To: Ingo Molnar
Cc: Waiman Long, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
Heiko Carstens, linux-kernel, linux-api, linux-doc, Jason Low,
Scott J Norton
On Mon, 21 Jul 2014, Ingo Molnar wrote:
> * Waiman Long <Waiman.Long@hp.com> wrote:
>
> > Testing done on a 4-socket Westmere-EX boxes with 40 cores (HT off)
> > showed the following performance data (average kops/s) with various
> > load factor (number of pause instructions) used in the critical
> > section using an userspace mutex microbenchmark.
> >
> > Threads Load Waiting Futex Spinning Futex %Change
> > ------- ---- ------------- -------------- -------
> > 256 1 6894 8883 +29%
> > 256 10 3656 4912 +34%
> > 256 50 1332 4358 +227%
> > 256 100 792 2753 +248%
> > 10 1 6382 4838 -24%
> > 10 10 3614 4748 +31%
> > 10 50 1319 3900 +196%
> > 10 100 782 2459 +214%
> > 2 1 7905 7194 -9.0%
> > 2 10 4556 4717 +3.5%
> > 2 50 2191 4167 +90%
> > 2 100 1767 2407 +36%
>
> So the numbers look interesting - but it would be _really_ important
> to provide noise/sttdev figures in a sixth column as well (denoted in
> percentage units, not in benchmark units), so that we know how
> significant a particular speedup (or slowdown) is.
We care about that, once we have something which
- Has a proper design
- Covers all the corner cases of futexes
- Does not introduce a gazillions of new lines of codes in futex.c
- Does not create another set of subtle security issues. I'm so not
interested to do the same exercise again - my brain still suffers.
The numbers provided are completely irrelevant as the implementation
is just the most idiotic approach to avoid all corner cases of
futexes, error handling and the proper treatment of detached kernel
state for the price of adding a completely unreviewable clusterfuck to
futex.c.
So, no. Don't encourage that number wankery any further. It's going
nowhere, period.
Thanks,
tglx
^ permalink raw reply [flat|nested] 51+ messages in thread
[parent not found: <20140721211801.GA12149-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>]
* Re: [RFC PATCH 0/5] futex: introduce an optimistic spinning futex
[not found] ` <20140721211801.GA12149-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
@ 2014-07-22 19:36 ` Waiman Long
0 siblings, 0 replies; 51+ messages in thread
From: Waiman Long @ 2014-07-22 19:36 UTC (permalink / raw)
To: Ingo Molnar
Cc: Thomas Gleixner, Peter Zijlstra, Darren Hart, Davidlohr Bueso,
Heiko Carstens, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
linux-api-u79uwXL29TY76Z2rM5mHXA,
linux-doc-u79uwXL29TY76Z2rM5mHXA, Jason Low, Scott J Norton
On 07/21/2014 05:18 PM, Ingo Molnar wrote:
> * Waiman Long<Waiman.Long-VXdhtT5mjnY@public.gmane.org> wrote:
>
>> Testing done on a 4-socket Westmere-EX boxes with 40 cores (HT off)
>> showed the following performance data (average kops/s) with various
>> load factor (number of pause instructions) used in the critical
>> section using an userspace mutex microbenchmark.
>>
>> Threads Load Waiting Futex Spinning Futex %Change
>> ------- ---- ------------- -------------- -------
>> 256 1 6894 8883 +29%
>> 256 10 3656 4912 +34%
>> 256 50 1332 4358 +227%
>> 256 100 792 2753 +248%
>> 10 1 6382 4838 -24%
>> 10 10 3614 4748 +31%
>> 10 50 1319 3900 +196%
>> 10 100 782 2459 +214%
>> 2 1 7905 7194 -9.0%
>> 2 10 4556 4717 +3.5%
>> 2 50 2191 4167 +90%
>> 2 100 1767 2407 +36%
> So the numbers look interesting - but it would be _really_ important
> to provide noise/sttdev figures in a sixth column as well (denoted in
> percentage units, not in benchmark units), so that we know how
> significant a particular speedup (or slowdown) is.
>
> Thanks,
>
> Ingo
The performance can varies quite a bit depending on what other processes
are running at the test execution time. I will include stddev data in
the next iteration of the patch.
-Longman
^ permalink raw reply [flat|nested] 51+ messages in thread