* [PATCH v7 0/5] locking/qspinlock: Enhance pvqspinlock performance
@ 2015-09-22 20:50 Waiman Long
2015-09-22 20:50 ` [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
` (4 more replies)
0 siblings, 5 replies; 22+ messages in thread
From: Waiman Long @ 2015-09-22 20:50 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
Waiman Long
v6->v7:
- Removed arch/x86/include/asm/qspinlock.h from patch 1.
- Removed the unconditional PV kick patch as it has been merged
into tip.
- Changed the pvstat_inc() API to add a new condition parameter.
- Added comments and rearrange code in patch 4 to clarify where
lock stealing happened.
- In patch 5, removed the check for pv_wait count when deciding when
to wait early.
- Updated copyrights and email address.
v5->v6:
- Added a new patch 1 to relax the cmpxchg and xchg operations in
the native code path to reduce performance overhead on non-x86
architectures.
- Updated the unconditional PV kick patch as suggested by PeterZ.
- Added a new patch to allow one lock stealing attempt at slowpath
entry point to reduce performance penalty due to lock waiter
preemption.
- Removed the pending bit and kick-ahead patches as they didn't show
any noticeable performance improvement on top of the lock stealing
patch.
- Simplified the adaptive spinning patch as the lock stealing patch
allows more aggressive pv_wait() without much performance penalty
in non-overcommitted VMs.
v4->v5:
- Rebased the patch to the latest tip tree.
- Corrected the comments and commit log for patch 1.
- Removed the v4 patch 5 as PV kick deferment is no longer needed with
the new tip tree.
- Simplified the adaptive spinning patch (patch 6) & improve its
performance a bit further.
- Re-ran the benchmark test with the new patch.
v3->v4:
- Patch 1: add comment about possible racing condition in PV unlock.
- Patch 2: simplified the pv_pending_lock() function as suggested by
Davidlohr.
- Move PV unlock optimization patch forward to patch 4 & rerun
performance test.
v2->v3:
- Moved deferred kicking enablement patch forward & move back
the kick-ahead patch to make the effect of kick-ahead more visible.
- Reworked patch 6 to make it more readable.
- Reverted back to use state as a tri-state variable instead of
adding an additional bistate variable.
- Added performance data for different values of PV_KICK_AHEAD_MAX.
- Add a new patch to optimize PV unlock code path performance.
v1->v2:
- Take out the queued unfair lock patches
- Add a patch to simplify the PV unlock code
- Move pending bit and statistics collection patches to the front
- Keep vCPU kicking in pv_kick_node(), but defer it to unlock time
when appropriate.
- Change the wait-early patch to use adaptive spinning to better
balance the difference effect on normal and over-committed guests.
- Add patch-to-patch performance changes in the patch commit logs.
This patchset tries to improve the performance of both regular and
over-commmitted VM guests. The adaptive spinning patch was inspired
by the "Do Virtual Machines Really Scale?" blog from Sanidhya Kashyap.
Patch 1 relaxes the memory order restriction of atomic operations by
using less restrictive _acquire and _release variants of cmpxchg()
and xchg(). This will reduce performance overhead when ported to other
non-x86 architectures.
Patch 2 optimizes the PV unlock code path performance for x86-64
architecture.
Patch 3 allows the collection of various count data that are useful
to see what is happening in the system. They do add a bit of overhead
when enabled slowing performance a tiny bit.
Patch 4 allows one lock stealing attempt at slowpath entry. This causes
a pretty big performance improvement for over-committed VM guests.
Patch 5 enables adaptive spinning in the queue nodes. This patch
leads to further performance improvement in over-committed guest,
though it is not as big as the previous patch.
Waiman Long (5):
locking/qspinlock: relaxes cmpxchg & xchg ops in native code
locking/pvqspinlock, x86: Optimize PV unlock code path
locking/pvqspinlock: Collect slowpath lock statistics
locking/pvqspinlock: Allow 1 lock stealing attempt
locking/pvqspinlock: Queue node adaptive spinning
arch/x86/Kconfig | 9 +
arch/x86/include/asm/qspinlock_paravirt.h | 59 +++++
include/asm-generic/qspinlock.h | 9 +-
kernel/locking/qspinlock.c | 48 +++--
kernel/locking/qspinlock_paravirt.h | 376 +++++++++++++++++++++++++----
5 files changed, 437 insertions(+), 64 deletions(-)
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
2015-09-22 20:50 [PATCH v7 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
@ 2015-09-22 20:50 ` Waiman Long
2015-10-13 18:02 ` Peter Zijlstra
2015-09-22 20:50 ` [PATCH v7 2/5] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
` (3 subsequent siblings)
4 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2015-09-22 20:50 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
Waiman Long
This patch replaces the cmpxchg() and xchg() calls in the native
qspinlock code with more relaxed versions of those calls to enable
other architectures to adopt queued spinlocks with less performance
overhead.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
include/asm-generic/qspinlock.h | 9 +++++----
kernel/locking/qspinlock.c | 24 +++++++++++++++++++-----
2 files changed, 24 insertions(+), 9 deletions(-)
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index e2aadbc..39e1cb2 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -12,8 +12,9 @@
* GNU General Public License for more details.
*
* (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
*
- * Authors: Waiman Long <waiman.long@hp.com>
+ * Authors: Waiman Long <waiman.long@hpe.com>
*/
#ifndef __ASM_GENERIC_QSPINLOCK_H
#define __ASM_GENERIC_QSPINLOCK_H
@@ -62,7 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
static __always_inline int queued_spin_trylock(struct qspinlock *lock)
{
if (!atomic_read(&lock->val) &&
- (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
+ (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
return 1;
return 0;
}
@@ -77,7 +78,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
{
u32 val;
- val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
+ val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
if (likely(val == 0))
return;
queued_spin_lock_slowpath(lock, val);
@@ -93,7 +94,7 @@ static __always_inline void queued_spin_unlock(struct qspinlock *lock)
/*
* smp_mb__before_atomic() in order to guarantee release semantics
*/
- smp_mb__before_atomic_dec();
+ smp_mb__before_atomic();
atomic_sub(_Q_LOCKED_VAL, &lock->val);
}
#endif
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 87e9ce6..d2f3fda 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -14,8 +14,9 @@
* (C) Copyright 2013-2015 Hewlett-Packard Development Company, L.P.
* (C) Copyright 2013-2014 Red Hat, Inc.
* (C) Copyright 2015 Intel Corp.
+ * (C) Copyright 2015 Hewlett-Packard Enterprise Development LP
*
- * Authors: Waiman Long <waiman.long@hp.com>
+ * Authors: Waiman Long <waiman.long@hpe.com>
* Peter Zijlstra <peterz@infradead.org>
*/
@@ -176,7 +177,12 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
{
struct __qspinlock *l = (void *)lock;
- return (u32)xchg(&l->tail, tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+ /*
+ * Use release semantics to make sure that the MCS node is properly
+ * initialized before changing the tail code.
+ */
+ return (u32)xchg_release(&l->tail,
+ tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
}
#else /* _Q_PENDING_BITS == 8 */
@@ -208,7 +214,11 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
for (;;) {
new = (val & _Q_LOCKED_PENDING_MASK) | tail;
- old = atomic_cmpxchg(&lock->val, val, new);
+ /*
+ * Use release semantics to make sure that the MCS node is
+ * properly initialized before changing the tail code.
+ */
+ old = atomic_cmpxchg_release(&lock->val, val, new);
if (old == val)
break;
@@ -319,7 +329,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
if (val == new)
new |= _Q_PENDING_VAL;
- old = atomic_cmpxchg(&lock->val, val, new);
+ old = atomic_cmpxchg_acquire(&lock->val, val, new);
if (old == val)
break;
@@ -426,7 +436,11 @@ queue:
set_locked(lock);
break;
}
- old = atomic_cmpxchg(&lock->val, val, _Q_LOCKED_VAL);
+ /*
+ * The smp_load_acquire() call above has provided the necessary
+ * acquire semantics required for locking.
+ */
+ old = atomic_cmpxchg_relaxed(&lock->val, val, _Q_LOCKED_VAL);
if (old == val)
goto release; /* No contention */
--
1.7.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v7 2/5] locking/pvqspinlock, x86: Optimize PV unlock code path
2015-09-22 20:50 [PATCH v7 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
2015-09-22 20:50 ` [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
@ 2015-09-22 20:50 ` Waiman Long
2015-09-22 20:50 ` [PATCH v7 3/5] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
` (2 subsequent siblings)
4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-09-22 20:50 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
Waiman Long
The unlock function in queued spinlocks was optimized for better
performance on bare metal systems at the expense of virtualized guests.
For x86-64 systems, the unlock call needs to go through a
PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
registers before calling the real __pv_queued_spin_unlock()
function. The thunk code may also be in a separate cacheline from
__pv_queued_spin_unlock().
This patch optimizes the PV unlock code path by:
1) Moving the unlock slowpath code from the fastpath into a separate
__pv_queued_spin_unlock_slowpath() function to make the fastpath
as simple as possible..
2) For x86-64, hand-coded an assembly function to combine the register
saving thunk code with the fastpath code. Only registers that
are used in the fastpath will be saved and restored. If the
fastpath fails, the slowpath function will be called via another
PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
__pv_queued_spin_unlock() code as the thunk saves and restores
only one 32-bit register.
With a microbenchmark of 5M lock-unlock loop, the table below shows
the execution times before and after the patch with different number
of threads in a VM running on a 32-core Westmere-EX box with x86-64
4.2-rc1 based kernels:
Threads Before patch After patch % Change
------- ------------ ----------- --------
1 134.1 ms 119.3 ms -11%
2 1286 ms 953 ms -26%
3 3715 ms 3480 ms -6.3%
4 4092 ms 3764 ms -8.0%
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
arch/x86/include/asm/qspinlock_paravirt.h | 59 +++++++++++++++++++++++++++++
kernel/locking/qspinlock_paravirt.h | 43 +++++++++++++--------
2 files changed, 86 insertions(+), 16 deletions(-)
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
index b002e71..9f92c18 100644
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ b/arch/x86/include/asm/qspinlock_paravirt.h
@@ -1,6 +1,65 @@
#ifndef __ASM_QSPINLOCK_PARAVIRT_H
#define __ASM_QSPINLOCK_PARAVIRT_H
+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
+#define __pv_queued_spin_unlock __pv_queued_spin_unlock
+#define PV_UNLOCK "__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH "__raw_callee_save___pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ * struct __qspinlock *l = (void *)lock;
+ * u8 lockval = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ *
+ * if (likely(lockval == _Q_LOCKED_VAL))
+ * return;
+ * pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ * rdi = lock (first argument)
+ * rsi = lockval (second argument)
+ * rdx = internal variable (set to 0)
+ */
+asm (".pushsection .text;"
+ ".globl " PV_UNLOCK ";"
+ ".align 4,0x90;"
+ PV_UNLOCK ": "
+ "push %rdx;"
+ "mov $0x1,%eax;"
+ "xor %edx,%edx;"
+ "lock cmpxchg %dl,(%rdi);"
+ "cmp $0x1,%al;"
+ "jne .slowpath;"
+ "pop %rdx;"
+ "ret;"
+ ".slowpath: "
+ "push %rsi;"
+ "movzbl %al,%esi;"
+ "call " PV_UNLOCK_SLOWPATH ";"
+ "pop %rsi;"
+ "pop %rdx;"
+ "ret;"
+ ".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+ ".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);
+#endif /* CONFIG_64BIT */
#endif
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index f0450ff..4bd323d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -308,23 +308,14 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
}
/*
- * PV version of the unlock function to be used in stead of
- * queued_spin_unlock().
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
*/
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+__visible void
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
{
struct __qspinlock *l = (void *)lock;
struct pv_node *node;
- u8 locked;
-
- /*
- * We must not unlock if SLOW, because in that case we must first
- * unhash. Otherwise it would be possible to have multiple @lock
- * entries, which would be BAD.
- */
- locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
- if (likely(locked == _Q_LOCKED_VAL))
- return;
if (unlikely(locked != _Q_SLOW_VAL)) {
WARN(!debug_locks_silent,
@@ -363,12 +354,32 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
*/
pv_kick(node->cpu);
}
+
/*
* Include the architecture specific callee-save thunk of the
* __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() near the top of the file to make sure
- * that the callee-save thunk and the real unlock function are close
- * to each other sharing consecutive instruction cachelines.
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
*/
#include <asm/qspinlock_paravirt.h>
+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+ u8 locked;
+
+ /*
+ * We must not unlock if SLOW, because in that case we must first
+ * unhash. Otherwise it would be possible to have multiple @lock
+ * entries, which would be BAD.
+ */
+ locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, 0);
+ if (likely(locked == _Q_LOCKED_VAL))
+ return;
+
+ __pv_queued_spin_unlock_slowpath(lock, locked);
+}
+#endif /* __pv_queued_spin_unlock */
--
1.7.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v7 3/5] locking/pvqspinlock: Collect slowpath lock statistics
2015-09-22 20:50 [PATCH v7 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
2015-09-22 20:50 ` [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
2015-09-22 20:50 ` [PATCH v7 2/5] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
@ 2015-09-22 20:50 ` Waiman Long
2015-10-13 20:05 ` Peter Zijlstra
2015-09-22 20:50 ` [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
2015-09-22 20:50 ` [PATCH v7 5/5] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
4 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2015-09-22 20:50 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
Waiman Long
This patch enables the accumulation of kicking and waiting related
PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
option is selected. It also enables the collection of data which
enable us to calculate the kicking and wakeup latencies which have
a heavy dependency on the CPUs being used.
Debugfs is used for managing the statistics counts for its simplicity.
It has the limitation that per-cpu statistic counters cannot be
used. Therefore, atomic instructions will be needed to update the
counters. This introduces a certain amount of performance overhead
when this feature is enabled.
The measured latencies for different CPUs are:
CPU Wakeup Kicking
--- ------ -------
Haswell-EX 63.6us 7.4us
Westmere-EX 67.6us 9.3us
The measured latencies varied a bit from run-to-run. The wakeup
latency is much higher than the kicking latency.
A sample of statistics counts after system bootup (with vCPU
overcommit) was:
hash_hops_count=9001
kick_latencies=138047878
kick_unlock_count=9001
kick_wait_count=9000
spurious_wakeup=3
wait_again_count=2
wait_head_count=10
wait_node_count=8994
wake_latencies=713195944
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
arch/x86/Kconfig | 9 ++
kernel/locking/qspinlock_paravirt.h | 168 +++++++++++++++++++++++++++++++++-
2 files changed, 172 insertions(+), 5 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 01965fa..0842df7 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -721,6 +721,15 @@ config PARAVIRT_SPINLOCKS
If you are unsure how to answer this question, answer Y.
+config QUEUED_LOCK_STAT
+ bool "Paravirt queued lock statistics"
+ depends on PARAVIRT && DEBUG_FS && QUEUED_SPINLOCKS
+ ---help---
+ Enable the collection of statistical data on the behavior of
+ paravirtualized queued spinlocks and report them on debugfs.
+ There will be a slight performance overhead when this option
+ is enabled.
+
source "arch/x86/xen/Kconfig"
config KVM_GUEST
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 4bd323d..ab823b6 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -41,6 +41,148 @@ struct pv_node {
};
/*
+ * PV qspinlock statistics
+ */
+enum pv_qlock_stat {
+ pvstat_wait_head,
+ pvstat_wait_node,
+ pvstat_wait_again,
+ pvstat_kick_wait,
+ pvstat_kick_unlock,
+ pvstat_spurious,
+ pvstat_hops,
+ pvstat_num /* Total number of statistics counts */
+};
+
+#ifdef CONFIG_QUEUED_LOCK_STAT
+/*
+ * Collect pvqspinlock statiatics
+ */
+#include <linux/debugfs.h>
+#include <linux/sched.h>
+
+static const char * const stat_fsnames[pvstat_num] = {
+ [pvstat_wait_head] = "wait_head_count",
+ [pvstat_wait_node] = "wait_node_count",
+ [pvstat_wait_again] = "wait_again_count",
+ [pvstat_kick_wait] = "kick_wait_count",
+ [pvstat_kick_unlock] = "kick_unlock_count",
+ [pvstat_spurious] = "spurious_wakeup",
+ [pvstat_hops] = "hash_hops_count",
+};
+
+static atomic_t pvstats[pvstat_num];
+
+/*
+ * pv_kick_latencies = sum of all pv_kick latencies in ns
+ * pv_wake_latencies = sum of all wakeup latencies in ns
+ *
+ * Avg kick latency = pv_kick_latencies/kick_unlock_count
+ * Avg wake latency = pv_wake_latencies/kick_wait_count
+ * Avg # of hops/hash = hash_hops_count/kick_unlock_count
+ */
+static atomic64_t pv_kick_latencies, pv_wake_latencies;
+static DEFINE_PER_CPU(u64, pv_kick_time);
+
+/*
+ * Reset all the statistics counts if set
+ */
+static bool reset_cnts __read_mostly;
+
+/*
+ * Initialize debugfs for the PV qspinlock statistics
+ */
+static int __init pv_qspinlock_debugfs(void)
+{
+ struct dentry *d_pvqlock = debugfs_create_dir("pv-qspinlock", NULL);
+ int i;
+
+ if (!d_pvqlock)
+ pr_warn("Could not create 'pv-qspinlock' debugfs directory\n");
+
+ for (i = 0; i < pvstat_num; i++)
+ debugfs_create_u32(stat_fsnames[i], 0444, d_pvqlock,
+ (u32 *)&pvstats[i]);
+ debugfs_create_u64("kick_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_kick_latencies);
+ debugfs_create_u64("wake_latencies", 0444, d_pvqlock,
+ (u64 *)&pv_wake_latencies);
+ debugfs_create_bool("reset_cnts", 0644, d_pvqlock, (u32 *)&reset_cnts);
+ return 0;
+}
+fs_initcall(pv_qspinlock_debugfs);
+
+/*
+ * Reset all the counts
+ */
+static noinline void pvstat_reset(void)
+{
+ int i;
+
+ for (i = 0; i < pvstat_num; i++)
+ atomic_set(&pvstats[i], 0);
+ atomic64_set(&pv_kick_latencies, 0);
+ atomic64_set(&pv_wake_latencies, 0);
+ reset_cnts = 0;
+}
+
+/*
+ * Increment the PV qspinlock statistics counts
+ */
+static inline void pvstat_inc(enum pv_qlock_stat stat, bool cond)
+{
+ if (cond)
+ atomic_inc(&pvstats[stat]);
+ if (unlikely(reset_cnts))
+ pvstat_reset();
+}
+
+/*
+ * PV hash hop count
+ */
+static inline void pvstat_hop(int hopcnt)
+{
+ atomic_add(hopcnt, &pvstats[pvstat_hops]);
+}
+
+/*
+ * Replacement function for pv_kick()
+ */
+static inline void __pv_kick(int cpu)
+{
+ u64 start = sched_clock();
+
+ *per_cpu_ptr(&pv_kick_time, cpu) = start;
+ pv_kick(cpu);
+ atomic64_add(sched_clock() - start, &pv_kick_latencies);
+}
+
+/*
+ * Replacement function for pv_wait()
+ */
+static inline void __pv_wait(u8 *ptr, u8 val)
+{
+ u64 *pkick_time = this_cpu_ptr(&pv_kick_time);
+
+ *pkick_time = 0;
+ pv_wait(ptr, val);
+ if (*pkick_time) {
+ atomic64_add(sched_clock() - *pkick_time, &pv_wake_latencies);
+ pvstat_inc(pvstat_kick_wait, true);
+ }
+}
+
+#define pv_kick(c) __pv_kick(c)
+#define pv_wait(p, v) __pv_wait(p, v)
+
+#else /* CONFIG_QUEUED_LOCK_STAT */
+
+static inline void pvstat_inc(enum pv_qlock_stat stat, bool cond) { }
+static inline void pvstat_hop(int hopcnt) { }
+
+#endif /* CONFIG_QUEUED_LOCK_STAT */
+
+/*
* Lock and MCS node addresses hash table for fast lookup
*
* Hashing is done on a per-cacheline basis to minimize the need to access
@@ -100,10 +242,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
{
unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
struct pv_hash_entry *he;
+ int hopcnt = 0;
for_each_hash_entry(he, offset, hash) {
+ hopcnt++;
if (!cmpxchg(&he->lock, NULL, lock)) {
WRITE_ONCE(he->node, node);
+ pvstat_hop(hopcnt);
return &he->lock;
}
}
@@ -164,9 +309,10 @@ static void pv_init_node(struct mcs_spinlock *node)
static void pv_wait_node(struct mcs_spinlock *node)
{
struct pv_node *pn = (struct pv_node *)node;
+ int waitcnt = 0;
int loop;
- for (;;) {
+ for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked))
return;
@@ -184,12 +330,16 @@ static void pv_wait_node(struct mcs_spinlock *node)
*/
smp_store_mb(pn->state, vcpu_halted);
- if (!READ_ONCE(node->locked))
+ if (!READ_ONCE(node->locked)) {
+ pvstat_inc(pvstat_wait_node, true);
+ pvstat_inc(pvstat_wait_again, waitcnt);
pv_wait(&pn->state, vcpu_halted);
+ }
/*
- * If pv_kick_node() changed us to vcpu_hashed, retain that value
- * so that pv_wait_head() knows to not also try to hash this lock.
+ * If pv_kick_node() changed us to vcpu_hashed, retain that
+ * value so that pv_wait_head() knows to not also try to hash
+ * this lock.
*/
cmpxchg(&pn->state, vcpu_halted, vcpu_running);
@@ -200,6 +350,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
* So it is better to spin for a while in the hope that the
* MCS lock will be released soon.
*/
+ pvstat_inc(pvstat_spurious, !READ_ONCE(node->locked));
}
/*
@@ -250,6 +401,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
struct pv_node *pn = (struct pv_node *)node;
struct __qspinlock *l = (void *)lock;
struct qspinlock **lp = NULL;
+ int waitcnt = 0;
int loop;
/*
@@ -259,7 +411,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
if (READ_ONCE(pn->state) == vcpu_hashed)
lp = (struct qspinlock **)1;
- for (;;) {
+ for (;; waitcnt++) {
for (loop = SPIN_THRESHOLD; loop; loop--) {
if (!READ_ONCE(l->locked))
return;
@@ -290,14 +442,19 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
return;
}
}
+ pvstat_inc(pvstat_wait_head, true);
+ pvstat_inc(pvstat_wait_again, waitcnt);
pv_wait(&l->locked, _Q_SLOW_VAL);
+ if (!READ_ONCE(l->locked))
+ return;
/*
* The unlocker should have freed the lock before kicking the
* CPU. So if the lock is still not free, it is a spurious
* wakeup and so the vCPU should wait again after spinning for
* a while.
*/
+ pvstat_inc(pvstat_spurious, true);
}
/*
@@ -352,6 +509,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
* vCPU is harmless other than the additional latency in completing
* the unlock.
*/
+ pvstat_inc(pvstat_kick_unlock, true);
pv_kick(node->cpu);
}
--
1.7.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-09-22 20:50 [PATCH v7 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
` (2 preceding siblings ...)
2015-09-22 20:50 ` [PATCH v7 3/5] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
@ 2015-09-22 20:50 ` Waiman Long
2015-10-13 18:23 ` Peter Zijlstra
` (2 more replies)
2015-09-22 20:50 ` [PATCH v7 5/5] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
4 siblings, 3 replies; 22+ messages in thread
From: Waiman Long @ 2015-09-22 20:50 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
Waiman Long
This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath. This helps to reduce the performance
penalty caused by lock waiter preemption while not having much of
the downsides of a real unfair lock.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
For the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
kernel/locking/qspinlock.c | 19 ++---
kernel/locking/qspinlock_paravirt.h | 128 ++++++++++++++++++++++++++++-------
2 files changed, 112 insertions(+), 35 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d2f3fda..42b20b2 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -249,17 +249,15 @@ static __always_inline void set_locked(struct qspinlock *lock)
static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_kick_node(struct qspinlock *lock,
- struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_head(struct qspinlock *lock,
- struct mcs_spinlock *node) { }
-
+static __always_inline bool __pv_wait_head_and_lock(struct qspinlock *lock,
+ struct mcs_spinlock *node,
+ u32 tail)
+ { return false; }
#define pv_enabled() false
#define pv_init_node __pv_init_node
#define pv_wait_node __pv_wait_node
-#define pv_kick_node __pv_kick_node
-#define pv_wait_head __pv_wait_head
+#define pv_wait_head_and_lock __pv_wait_head_and_lock
#ifdef CONFIG_PARAVIRT_SPINLOCKS
#define queued_spin_lock_slowpath native_queued_spin_lock_slowpath
@@ -417,7 +415,8 @@ queue:
* does not imply a full barrier.
*
*/
- pv_wait_head(lock, node);
+ if (pv_wait_head_and_lock(lock, node, tail))
+ goto release;
while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
cpu_relax();
@@ -454,7 +453,6 @@ queue:
cpu_relax();
arch_mcs_spin_unlock_contended(&next->locked);
- pv_kick_node(lock, next);
release:
/*
@@ -475,8 +473,7 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
#undef pv_init_node
#undef pv_wait_node
-#undef pv_kick_node
-#undef pv_wait_head
+#undef pv_wait_head_and_lock
#undef queued_spin_lock_slowpath
#define queued_spin_lock_slowpath __pv_queued_spin_lock_slowpath
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index ab823b6..66f3964 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -41,6 +41,31 @@ struct pv_node {
};
/*
+ * Allow one unfair trylock when entering the PV slowpath to reduce the
+ * performance impact of lock waiter preemption (either explicitly via
+ * pv_wait or implicitly via PLE). This function will be called once when
+ * a lock waiter enter the slowpath before being queued.
+ *
+ * A little bit of unfairness here can improve performance without many
+ * of the downsides of a real unfair lock.
+ */
+#define queued_spin_trylock(l) pv_queued_spin_trylock_unfair(l)
+static inline bool pv_queued_spin_trylock_unfair(struct qspinlock *lock)
+{
+ struct __qspinlock *l = (void *)lock;
+
+ if (READ_ONCE(l->locked))
+ return 0;
+ /*
+ * Wait a bit here to ensure that an actively spinning queue head vCPU
+ * has a fair chance of getting the lock.
+ */
+ cpu_relax();
+
+ return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;
+}
+
+/*
* PV qspinlock statistics
*/
enum pv_qlock_stat {
@@ -49,6 +74,7 @@ enum pv_qlock_stat {
pvstat_wait_again,
pvstat_kick_wait,
pvstat_kick_unlock,
+ pvstat_steal_lock,
pvstat_spurious,
pvstat_hops,
pvstat_num /* Total number of statistics counts */
@@ -67,6 +93,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_wait_again] = "wait_again_count",
[pvstat_kick_wait] = "kick_wait_count",
[pvstat_kick_unlock] = "kick_unlock_count",
+ [pvstat_steal_lock] = "lock_stealing_count",
[pvstat_spurious] = "spurious_wakeup",
[pvstat_hops] = "hash_hops_count",
};
@@ -146,6 +173,19 @@ static inline void pvstat_hop(int hopcnt)
}
/*
+ * PV unfair trylock count
+ */
+static inline int pvstat_trylock_unfair(struct qspinlock *lock)
+{
+ int ret = pv_queued_spin_trylock_unfair(lock);
+
+ pvstat_inc(pvstat_steal_lock, ret);
+ return ret;
+}
+#undef queued_spin_trylock
+#define queued_spin_trylock(l) pvstat_trylock_unfair(l)
+
+/*
* Replacement function for pv_kick()
*/
static inline void __pv_kick(int cpu)
@@ -338,8 +378,8 @@ static void pv_wait_node(struct mcs_spinlock *node)
/*
* If pv_kick_node() changed us to vcpu_hashed, retain that
- * value so that pv_wait_head() knows to not also try to hash
- * this lock.
+ * value so that pv_wait_head_and_lock() knows to not also
+ * try to hash this lock.
*/
cmpxchg(&pn->state, vcpu_halted, vcpu_running);
@@ -363,8 +403,9 @@ static void pv_wait_node(struct mcs_spinlock *node)
/*
* Called after setting next->locked = 1 when we're the lock owner.
*
- * Instead of waking the waiters stuck in pv_wait_node() advance their state such
- * that they're waiting in pv_wait_head(), this avoids a wake/sleep cycle.
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state
+ * such that they're waiting in pv_wait_head_and_lock(), this avoids a
+ * wake/sleep cycle.
*/
static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
{
@@ -396,13 +437,15 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
* Wait for l->locked to become clear; halt the vcpu after a short spin.
* __pv_queued_spin_unlock() will wake us.
*/
-static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
+static bool pv_wait_head_and_lock(struct qspinlock *lock,
+ struct mcs_spinlock *node, u32 tail)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct mcs_spinlock *next;
struct __qspinlock *l = (void *)lock;
struct qspinlock **lp = NULL;
int waitcnt = 0;
- int loop;
+ int loop, old;
/*
* If pv_kick_node() already advanced our state, we don't need to
@@ -412,10 +455,23 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
lp = (struct qspinlock **)1;
for (;; waitcnt++) {
- for (loop = SPIN_THRESHOLD; loop; loop--) {
- if (!READ_ONCE(l->locked))
- return;
- cpu_relax();
+ loop = SPIN_THRESHOLD;
+ while (loop) {
+ /*
+ * Spin until the lock is free
+ */
+ for (; loop && READ_ONCE(l->locked); loop--)
+ cpu_relax();
+ /*
+ * Seeing the lock is free, this queue head vCPU is
+ * the rightful next owner of the lock. However, the
+ * lock may have just been stolen by another task which
+ * has entered the slowpath. So we need to use atomic
+ * operation to make sure that we really get the lock.
+ * Otherwise, we have to wait again.
+ */
+ if (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)
+ goto gotlock;
}
if (!lp) { /* ONCE */
@@ -432,36 +488,60 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
*
* Matches the smp_rmb() in __pv_queued_spin_unlock().
*/
- if (!cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_SLOW_VAL)) {
+ if (xchg(&l->locked, _Q_SLOW_VAL) == 0) {
/*
- * The lock is free and _Q_SLOW_VAL has never
- * been set. Therefore we need to unhash before
- * getting the lock.
+ * The lock was free and now we own the lock.
+ * Change the lock value back to _Q_LOCKED_VAL
+ * and unhash the table.
*/
+ WRITE_ONCE(l->locked, _Q_LOCKED_VAL);
WRITE_ONCE(*lp, NULL);
- return;
+ goto gotlock;
}
}
pvstat_inc(pvstat_wait_head, true);
pvstat_inc(pvstat_wait_again, waitcnt);
pv_wait(&l->locked, _Q_SLOW_VAL);
- if (!READ_ONCE(l->locked))
- return;
/*
* The unlocker should have freed the lock before kicking the
* CPU. So if the lock is still not free, it is a spurious
- * wakeup and so the vCPU should wait again after spinning for
- * a while.
+ * wakeup or another vCPU has stolen the lock. The current
+ * vCPU should spin again.
*/
- pvstat_inc(pvstat_spurious, true);
+ pvstat_inc(pvstat_spurious, READ_ONCE(l->locked));
}
+gotlock:
/*
- * Lock is unlocked now; the caller will acquire it without waiting.
- * As with pv_wait_node() we rely on the caller to do a load-acquire
- * for us.
+ * We now have the lock. We need to either clear the tail code or
+ * notify the next one in queue as the new queue head.
*/
+ old = atomic_read(&lock->val);
+ while ((old & _Q_TAIL_MASK) == tail) {
+ int val;
+ int new = old & ~_Q_TAIL_MASK;
+
+ /*
+ * We are the only one in the queue, so clear the tail code
+ * and return.
+ */
+ val = atomic_cmpxchg(&lock->val, old, new);
+ if (old == val)
+ goto done;
+ old = val;
+ }
+
+ /*
+ * contended path; wait for next, release.
+ */
+ while (!(next = READ_ONCE(node->next)))
+ cpu_relax();
+
+ arch_mcs_spin_unlock_contended(&next->locked);
+ pv_kick_node(lock, next);
+done:
+ return 1;
}
/*
@@ -486,7 +566,7 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
* so we need a barrier to order the read of the node data in
* pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
*
- * Matches the cmpxchg() in pv_wait_head() setting _Q_SLOW_VAL.
+ * Matches the cmpxchg() in pv_wait_head_and_lock() setting _Q_SLOW_VAL.
*/
smp_rmb();
--
1.7.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v7 5/5] locking/pvqspinlock: Queue node adaptive spinning
2015-09-22 20:50 [PATCH v7 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
` (3 preceding siblings ...)
2015-09-22 20:50 ` [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
@ 2015-09-22 20:50 ` Waiman Long
4 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-09-22 20:50 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar, Thomas Gleixner, H. Peter Anvin
Cc: x86, linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
Waiman Long
In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.
This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the previous vCPU is not running.
Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:
Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s
For 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
kernel/locking/qspinlock.c | 5 ++-
kernel/locking/qspinlock_paravirt.h | 47 +++++++++++++++++++++++++++++++++-
2 files changed, 48 insertions(+), 4 deletions(-)
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 42b20b2..137322e 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -248,7 +248,8 @@ static __always_inline void set_locked(struct qspinlock *lock)
*/
static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node) { }
+static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
+ struct mcs_spinlock *prev) { }
static __always_inline bool __pv_wait_head_and_lock(struct qspinlock *lock,
struct mcs_spinlock *node,
u32 tail)
@@ -399,7 +400,7 @@ queue:
prev = decode_tail(old);
WRITE_ONCE(prev->next, node);
- pv_wait_node(node);
+ pv_wait_node(node, prev);
arch_mcs_spin_lock_contended(&node->locked);
}
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 66f3964..540dd6b 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -23,6 +23,19 @@
#define _Q_SLOW_VAL (3U << _Q_LOCKED_OFFSET)
/*
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will stop spinning if the vCPU in the previous node is
+ * not running. The one lock stealing attempt allowed at slowpath entry
+ * mitigates the slight slowdown for non-overcommitted guest with this
+ * aggressive wait-early mechanism.
+ *
+ * The status of the previous node will be checked at fixed interval
+ * controlled by PV_PREV_CHECK_MASK.
+ */
+#define PV_PREV_CHECK_MASK 0xff
+
+/*
* Queue node uses: vcpu_running & vcpu_halted.
* Queue head uses: vcpu_running & vcpu_hashed.
*/
@@ -72,6 +85,7 @@ enum pv_qlock_stat {
pvstat_wait_head,
pvstat_wait_node,
pvstat_wait_again,
+ pvstat_wait_early,
pvstat_kick_wait,
pvstat_kick_unlock,
pvstat_steal_lock,
@@ -91,6 +105,7 @@ static const char * const stat_fsnames[pvstat_num] = {
[pvstat_wait_head] = "wait_head_count",
[pvstat_wait_node] = "wait_node_count",
[pvstat_wait_again] = "wait_again_count",
+ [pvstat_wait_early] = "wait_early_count",
[pvstat_kick_wait] = "kick_wait_count",
[pvstat_kick_unlock] = "kick_unlock_count",
[pvstat_steal_lock] = "lock_stealing_count",
@@ -329,6 +344,20 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
}
/*
+ * Return true if when it is time to check the previous node which is not
+ * in a running state.
+ */
+static inline bool
+pv_wait_early(struct pv_node *prev, int loop)
+{
+
+ if ((loop & PV_PREV_CHECK_MASK) != 0)
+ return false;
+
+ return READ_ONCE(prev->state) != vcpu_running;
+}
+
+/*
* Initialize the PV part of the mcs_spinlock node.
*/
static void pv_init_node(struct mcs_spinlock *node)
@@ -346,16 +375,22 @@ static void pv_init_node(struct mcs_spinlock *node)
* pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
* behalf.
*/
-static void pv_wait_node(struct mcs_spinlock *node)
+static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
{
struct pv_node *pn = (struct pv_node *)node;
+ struct pv_node *pp = (struct pv_node *)prev;
int waitcnt = 0;
int loop;
+ bool wait_early;
for (;; waitcnt++) {
- for (loop = SPIN_THRESHOLD; loop; loop--) {
+ for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
if (READ_ONCE(node->locked))
return;
+ if (pv_wait_early(pp, loop)) {
+ wait_early = true;
+ break;
+ }
cpu_relax();
}
@@ -373,6 +408,7 @@ static void pv_wait_node(struct mcs_spinlock *node)
if (!READ_ONCE(node->locked)) {
pvstat_inc(pvstat_wait_node, true);
pvstat_inc(pvstat_wait_again, waitcnt);
+ pvstat_inc(pvstat_wait_early, wait_early);
pv_wait(&pn->state, vcpu_halted);
}
@@ -455,6 +491,12 @@ static bool pv_wait_head_and_lock(struct qspinlock *lock,
lp = (struct qspinlock **)1;
for (;; waitcnt++) {
+ /*
+ * Set correct vCPU state to be used by queue node wait-early
+ * mechanism.
+ */
+ WRITE_ONCE(pn->state, vcpu_running);
+
loop = SPIN_THRESHOLD;
while (loop) {
/*
@@ -499,6 +541,7 @@ static bool pv_wait_head_and_lock(struct qspinlock *lock,
goto gotlock;
}
}
+ WRITE_ONCE(pn->state, vcpu_halted);
pvstat_inc(pvstat_wait_head, true);
pvstat_inc(pvstat_wait_again, waitcnt);
pv_wait(&l->locked, _Q_SLOW_VAL);
--
1.7.1
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
2015-09-22 20:50 ` [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
@ 2015-10-13 18:02 ` Peter Zijlstra
2015-10-13 20:38 ` Waiman Long
2015-10-14 9:39 ` Will Deacon
0 siblings, 2 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-13 18:02 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso, Will Deacon,
Paul McKenney, boqun.feng
On Tue, Sep 22, 2015 at 04:50:40PM -0400, Waiman Long wrote:
> This patch replaces the cmpxchg() and xchg() calls in the native
> qspinlock code with more relaxed versions of those calls to enable
> other architectures to adopt queued spinlocks with less performance
> overhead.
> @@ -62,7 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
> static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> {
> if (!atomic_read(&lock->val) &&
> - (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> + (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> return 1;
> return 0;
> }
> @@ -77,7 +78,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
> {
> u32 val;
>
> - val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
> + val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
> if (likely(val == 0))
> return;
> queued_spin_lock_slowpath(lock, val);
> @@ -319,7 +329,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> if (val == new)
> new |= _Q_PENDING_VAL;
>
> - old = atomic_cmpxchg(&lock->val, val, new);
> + old = atomic_cmpxchg_acquire(&lock->val, val, new);
> if (old == val)
> break;
>
So given recent discussion, all this _release/_acquire stuff is starting
to worry me.
So we've not declared if they should be RCsc or RCpc, and given this
patch (and the previous ones) these lock primitives turn into RCpc if
the atomic primitives are RCpc.
So far only the proposed PPC implementation is RCpc -- and their current
spinlock implementation is also RCpc, but that is a point of discussion.
Just saying..
Also, I think we should annotate the control dependencies in these
things.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-09-22 20:50 ` [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
@ 2015-10-13 18:23 ` Peter Zijlstra
2015-10-13 20:41 ` Waiman Long
2015-10-13 19:39 ` Peter Zijlstra
2015-10-13 19:56 ` Peter Zijlstra
2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-13 18:23 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On Tue, Sep 22, 2015 at 04:50:43PM -0400, Waiman Long wrote:
> for (;; waitcnt++) {
> + loop = SPIN_THRESHOLD;
> + while (loop) {
> + /*
> + * Spin until the lock is free
> + */
> + for (; loop && READ_ONCE(l->locked); loop--)
> + cpu_relax();
> + /*
> + * Seeing the lock is free, this queue head vCPU is
> + * the rightful next owner of the lock. However, the
> + * lock may have just been stolen by another task which
> + * has entered the slowpath. So we need to use atomic
> + * operation to make sure that we really get the lock.
> + * Otherwise, we have to wait again.
> + */
> + if (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)
> + goto gotlock;
> }
for (loop = SPIN_THRESHOLD; loop; --loop) {
if (!READ_ONCE(l->locked) &&
cmpxchg(&l->locked, 0, _Q_LOCKED_VA) == 0)
goto gotlock;
cpu_relax();
}
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-09-22 20:50 ` [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
2015-10-13 18:23 ` Peter Zijlstra
@ 2015-10-13 19:39 ` Peter Zijlstra
2015-10-13 20:45 ` Waiman Long
2015-10-13 19:56 ` Peter Zijlstra
2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-13 19:39 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On Tue, Sep 22, 2015 at 04:50:43PM -0400, Waiman Long wrote:
> This patch allows one attempt for the lock waiter to steal the lock
> when entering the PV slowpath. This helps to reduce the performance
> penalty caused by lock waiter preemption while not having much of
> the downsides of a real unfair lock.
>
Changelog does not explain the implementation, which is subtle enough to
warrant a few words.
> @@ -417,7 +415,8 @@ queue:
> * does not imply a full barrier.
> *
> */
> - pv_wait_head(lock, node);
> + if (pv_wait_head_and_lock(lock, node, tail))
> + goto release;
That's very much: pv_wait_head_or_lock(), maybe _or_steal() is even
better.
> while ((val = smp_load_acquire(&lock->val.counter)) & _Q_LOCKED_PENDING_MASK)
> cpu_relax();
>
> @@ -454,7 +453,6 @@ queue:
> cpu_relax();
>
> arch_mcs_spin_unlock_contended(&next->locked);
> - pv_kick_node(lock, next);
Not sure about removing that, breaks symmetry.
> /*
> + * Allow one unfair trylock when entering the PV slowpath to reduce the
> + * performance impact of lock waiter preemption (either explicitly via
> + * pv_wait or implicitly via PLE). This function will be called once when
> + * a lock waiter enter the slowpath before being queued.
> + *
> + * A little bit of unfairness here can improve performance without many
> + * of the downsides of a real unfair lock.
> + */
> +#define queued_spin_trylock(l) pv_queued_spin_trylock_unfair(l)
> +static inline bool pv_queued_spin_trylock_unfair(struct qspinlock *lock)
> +{
> + struct __qspinlock *l = (void *)lock;
> +
> + if (READ_ONCE(l->locked))
> + return 0;
> + /*
> + * Wait a bit here to ensure that an actively spinning queue head vCPU
> + * has a fair chance of getting the lock.
> + */
> + cpu_relax();
> +
> + return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;
> +}
This doesn't seem to make any sense.. Its also very much distinct from
the rest of the patch and can easily be added in a separate patch with
separate performance numbers to show it does (or does not) make a
difference.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-09-22 20:50 ` [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
2015-10-13 18:23 ` Peter Zijlstra
2015-10-13 19:39 ` Peter Zijlstra
@ 2015-10-13 19:56 ` Peter Zijlstra
2015-10-13 20:50 ` Waiman Long
2 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-13 19:56 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On Tue, Sep 22, 2015 at 04:50:43PM -0400, Waiman Long wrote:
> +gotlock:
> /*
> + * We now have the lock. We need to either clear the tail code or
> + * notify the next one in queue as the new queue head.
> */
> + old = atomic_read(&lock->val);
> + while ((old & _Q_TAIL_MASK) == tail) {
> + int val;
> + int new = old & ~_Q_TAIL_MASK;
> +
> + /*
> + * We are the only one in the queue, so clear the tail code
> + * and return.
> + */
> + val = atomic_cmpxchg(&lock->val, old, new);
> + if (old == val)
> + goto done;
> + old = val;
> + }
> +
This i need to think about a wee bit; its almost the same...
So the below is exactly duplicated from the normal slowpath, so why
don't you keep that there?
It would get you something like:
if (pv_wait_head_or_steal(..))
goto stolen;
stolen:
> + /*
> + * contended path; wait for next, release.
> + */
> + while (!(next = READ_ONCE(node->next)))
> + cpu_relax();
> +
> + arch_mcs_spin_unlock_contended(&next->locked);
> + pv_kick_node(lock, next);
release:
...
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/5] locking/pvqspinlock: Collect slowpath lock statistics
2015-09-22 20:50 ` [PATCH v7 3/5] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
@ 2015-10-13 20:05 ` Peter Zijlstra
2015-10-13 21:06 ` Waiman Long
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-13 20:05 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On Tue, Sep 22, 2015 at 04:50:42PM -0400, Waiman Long wrote:
> @@ -100,10 +242,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
> {
> unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
> struct pv_hash_entry *he;
> + int hopcnt = 0;
>
> for_each_hash_entry(he, offset, hash) {
> + hopcnt++;
> if (!cmpxchg(&he->lock, NULL, lock)) {
> WRITE_ONCE(he->node, node);
> + pvstat_hop(hopcnt);
> return &he->lock;
> }
> }
> @@ -164,9 +309,10 @@ static void pv_init_node(struct mcs_spinlock *node)
> static void pv_wait_node(struct mcs_spinlock *node)
> {
> struct pv_node *pn = (struct pv_node *)node;
> + int waitcnt = 0;
> int loop;
>
> - for (;;) {
> + for (;; waitcnt++) {
> for (loop = SPIN_THRESHOLD; loop; loop--) {
> if (READ_ONCE(node->locked))
> return;
> @@ -250,6 +401,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> struct pv_node *pn = (struct pv_node *)node;
> struct __qspinlock *l = (void *)lock;
> struct qspinlock **lp = NULL;
> + int waitcnt = 0;
> int loop;
>
> /*
> @@ -259,7 +411,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
> if (READ_ONCE(pn->state) == vcpu_hashed)
> lp = (struct qspinlock **)1;
>
> - for (;;) {
> + for (;; waitcnt++) {
> for (loop = SPIN_THRESHOLD; loop; loop--) {
> if (!READ_ONCE(l->locked))
> return;
These things are ugly; did you verify that they compile away for the
!stats case?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
2015-10-13 18:02 ` Peter Zijlstra
@ 2015-10-13 20:38 ` Waiman Long
2015-10-13 20:47 ` Peter Zijlstra
2015-10-14 9:39 ` Will Deacon
1 sibling, 1 reply; 22+ messages in thread
From: Waiman Long @ 2015-10-13 20:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso, Will Deacon,
Paul McKenney, boqun.feng
On 10/13/2015 02:02 PM, Peter Zijlstra wrote:
> On Tue, Sep 22, 2015 at 04:50:40PM -0400, Waiman Long wrote:
>> This patch replaces the cmpxchg() and xchg() calls in the native
>> qspinlock code with more relaxed versions of those calls to enable
>> other architectures to adopt queued spinlocks with less performance
>> overhead.
>> @@ -62,7 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
>> static __always_inline int queued_spin_trylock(struct qspinlock *lock)
>> {
>> if (!atomic_read(&lock->val)&&
>> - (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
>> + (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
>> return 1;
>> return 0;
>> }
>> @@ -77,7 +78,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
>> {
>> u32 val;
>>
>> - val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
>> + val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
>> if (likely(val == 0))
>> return;
>> queued_spin_lock_slowpath(lock, val);
>> @@ -319,7 +329,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>> if (val == new)
>> new |= _Q_PENDING_VAL;
>>
>> - old = atomic_cmpxchg(&lock->val, val, new);
>> + old = atomic_cmpxchg_acquire(&lock->val, val, new);
>> if (old == val)
>> break;
>>
> So given recent discussion, all this _release/_acquire stuff is starting
> to worry me.
>
> So we've not declared if they should be RCsc or RCpc, and given this
> patch (and the previous ones) these lock primitives turn into RCpc if
> the atomic primitives are RCpc.
>
> So far only the proposed PPC implementation is RCpc -- and their current
> spinlock implementation is also RCpc, but that is a point of discussion.
>
> Just saying..
Davidlohr's patches to make similar changes in other locking code will
also have this issue. Anyway, the goal of this patch is to make the
generic qspinlock code less costly when ported to other architectures.
This change will have no effect on the x86 architecture which is the
only one using qspinlock at the moment.
>
> Also, I think we should annotate the control dependencies in these
> things.
Will do.
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-10-13 18:23 ` Peter Zijlstra
@ 2015-10-13 20:41 ` Waiman Long
2015-10-13 20:44 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2015-10-13 20:41 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On 10/13/2015 02:23 PM, Peter Zijlstra wrote:
> On Tue, Sep 22, 2015 at 04:50:43PM -0400, Waiman Long wrote:
>> for (;; waitcnt++) {
>> + loop = SPIN_THRESHOLD;
>> + while (loop) {
>> + /*
>> + * Spin until the lock is free
>> + */
>> + for (; loop&& READ_ONCE(l->locked); loop--)
>> + cpu_relax();
>> + /*
>> + * Seeing the lock is free, this queue head vCPU is
>> + * the rightful next owner of the lock. However, the
>> + * lock may have just been stolen by another task which
>> + * has entered the slowpath. So we need to use atomic
>> + * operation to make sure that we really get the lock.
>> + * Otherwise, we have to wait again.
>> + */
>> + if (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)
>> + goto gotlock;
>> }
> for (loop = SPIN_THRESHOLD; loop; --loop) {
> if (!READ_ONCE(l->locked)&&
> cmpxchg(&l->locked, 0, _Q_LOCKED_VA) == 0)
> goto gotlock;
>
> cpu_relax();
> }
>
This was the code that I used in my original patch, but it seems to
confuse you about doing too many lock stealing. So I separated it out to
make my intention more explicit. I will change it back to the old code.
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-10-13 20:41 ` Waiman Long
@ 2015-10-13 20:44 ` Peter Zijlstra
2015-10-15 20:44 ` Waiman Long
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-13 20:44 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On Tue, Oct 13, 2015 at 04:41:41PM -0400, Waiman Long wrote:
> On 10/13/2015 02:23 PM, Peter Zijlstra wrote:
> >On Tue, Sep 22, 2015 at 04:50:43PM -0400, Waiman Long wrote:
> >> for (;; waitcnt++) {
> >>+ loop = SPIN_THRESHOLD;
> >>+ while (loop) {
> >>+ /*
> >>+ * Spin until the lock is free
> >>+ */
> >>+ for (; loop&& READ_ONCE(l->locked); loop--)
> >>+ cpu_relax();
> >>+ /*
> >>+ * Seeing the lock is free, this queue head vCPU is
> >>+ * the rightful next owner of the lock. However, the
> >>+ * lock may have just been stolen by another task which
> >>+ * has entered the slowpath. So we need to use atomic
> >>+ * operation to make sure that we really get the lock.
> >>+ * Otherwise, we have to wait again.
> >>+ */
> >>+ if (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)
> >>+ goto gotlock;
> >> }
> > for (loop = SPIN_THRESHOLD; loop; --loop) {
> > if (!READ_ONCE(l->locked)&&
> > cmpxchg(&l->locked, 0, _Q_LOCKED_VA) == 0)
> > goto gotlock;
> >
> > cpu_relax();
> > }
> >
>
> This was the code that I used in my original patch, but it seems to confuse
> you about doing too many lock stealing. So I separated it out to make my
> intention more explicit. I will change it back to the old code.
Code should be compact; its the purpose of Changelogs and comments to
explain it if its subtle.
Here you made weird code and the comments still don't explain how its
starvation proof and the Changelog is almost empty of useful.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-10-13 19:39 ` Peter Zijlstra
@ 2015-10-13 20:45 ` Waiman Long
0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-10-13 20:45 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On 10/13/2015 03:39 PM, Peter Zijlstra wrote:
> On Tue, Sep 22, 2015 at 04:50:43PM -0400, Waiman Long wrote:
>> This patch allows one attempt for the lock waiter to steal the lock
>> when entering the PV slowpath. This helps to reduce the performance
>> penalty caused by lock waiter preemption while not having much of
>> the downsides of a real unfair lock.
>>
> Changelog does not explain the implementation, which is subtle enough to
> warrant a few words.
Will add more information into the changelog.
>
>> @@ -417,7 +415,8 @@ queue:
>> * does not imply a full barrier.
>> *
>> */
>> - pv_wait_head(lock, node);
>> + if (pv_wait_head_and_lock(lock, node, tail))
>> + goto release;
> That's very much: pv_wait_head_or_lock(), maybe _or_steal() is even
> better.
I am not very good at naming function. Changing it to _or_steal() is
fine for me.
>> while ((val = smp_load_acquire(&lock->val.counter))& _Q_LOCKED_PENDING_MASK)
>> cpu_relax();
>>
>> @@ -454,7 +453,6 @@ queue:
>> cpu_relax();
>>
>> arch_mcs_spin_unlock_contended(&next->locked);
>> - pv_kick_node(lock, next);
> Not sure about removing that, breaks symmetry.
>
>> /*
>> + * Allow one unfair trylock when entering the PV slowpath to reduce the
>> + * performance impact of lock waiter preemption (either explicitly via
>> + * pv_wait or implicitly via PLE). This function will be called once when
>> + * a lock waiter enter the slowpath before being queued.
>> + *
>> + * A little bit of unfairness here can improve performance without many
>> + * of the downsides of a real unfair lock.
>> + */
>> +#define queued_spin_trylock(l) pv_queued_spin_trylock_unfair(l)
>> +static inline bool pv_queued_spin_trylock_unfair(struct qspinlock *lock)
>> +{
>> + struct __qspinlock *l = (void *)lock;
>> +
>> + if (READ_ONCE(l->locked))
>> + return 0;
>> + /*
>> + * Wait a bit here to ensure that an actively spinning queue head vCPU
>> + * has a fair chance of getting the lock.
>> + */
>> + cpu_relax();
>> +
>> + return cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0;
>> +}
> This doesn't seem to make any sense.. Its also very much distinct from
> the rest of the patch and can easily be added in a separate patch with
> separate performance numbers to show it does (or does not) make a
> difference.
If you mean I don't need an extra cpu_relax() here, I can take that out.
It was there to make the active queue head vCPU having a higher chance
of getting the lock, but it is not essential.
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
2015-10-13 20:38 ` Waiman Long
@ 2015-10-13 20:47 ` Peter Zijlstra
0 siblings, 0 replies; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-13 20:47 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso, Will Deacon,
Paul McKenney, boqun.feng
On Tue, Oct 13, 2015 at 04:38:19PM -0400, Waiman Long wrote:
> On 10/13/2015 02:02 PM, Peter Zijlstra wrote:
> >On Tue, Sep 22, 2015 at 04:50:40PM -0400, Waiman Long wrote:
> >>This patch replaces the cmpxchg() and xchg() calls in the native
> >>qspinlock code with more relaxed versions of those calls to enable
> >>other architectures to adopt queued spinlocks with less performance
> >>overhead.
> >>@@ -62,7 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
> >> static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> >> {
> >> if (!atomic_read(&lock->val)&&
> >>- (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> >>+ (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> >> return 1;
> >> return 0;
> >> }
> >>@@ -77,7 +78,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
> >> {
> >> u32 val;
> >>
> >>- val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
> >>+ val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
> >> if (likely(val == 0))
> >> return;
> >> queued_spin_lock_slowpath(lock, val);
> >>@@ -319,7 +329,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> >> if (val == new)
> >> new |= _Q_PENDING_VAL;
> >>
> >>- old = atomic_cmpxchg(&lock->val, val, new);
> >>+ old = atomic_cmpxchg_acquire(&lock->val, val, new);
> >> if (old == val)
> >> break;
> >>
> >So given recent discussion, all this _release/_acquire stuff is starting
> >to worry me.
> >
> >So we've not declared if they should be RCsc or RCpc, and given this
> >patch (and the previous ones) these lock primitives turn into RCpc if
> >the atomic primitives are RCpc.
> >
> >So far only the proposed PPC implementation is RCpc -- and their current
> >spinlock implementation is also RCpc, but that is a point of discussion.
> >
> >Just saying..
>
> Davidlohr's patches to make similar changes in other locking code will also
> have this issue.
Yes, I only fully appreciated the RCpc pain last week :/
> Anyway, the goal of this patch is to make the generic
> qspinlock code less costly when ported to other architectures.
As long as we stay away from PPC this will be fine ;-) Luckily they are
unlikely to start using it since their LPAR hypervisor thingy isn't
really co-operative.
> This change
> will have no effect on the x86 architecture which is the only one using
I know ARM and ARGH64 will want to start using this, luckily both are
RCsc so no worries there.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-10-13 19:56 ` Peter Zijlstra
@ 2015-10-13 20:50 ` Waiman Long
2015-10-14 9:28 ` Peter Zijlstra
0 siblings, 1 reply; 22+ messages in thread
From: Waiman Long @ 2015-10-13 20:50 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On 10/13/2015 03:56 PM, Peter Zijlstra wrote:
> On Tue, Sep 22, 2015 at 04:50:43PM -0400, Waiman Long wrote:
>
>> +gotlock:
>> /*
>> + * We now have the lock. We need to either clear the tail code or
>> + * notify the next one in queue as the new queue head.
>> */
>> + old = atomic_read(&lock->val);
>> + while ((old& _Q_TAIL_MASK) == tail) {
>> + int val;
>> + int new = old& ~_Q_TAIL_MASK;
>> +
>> + /*
>> + * We are the only one in the queue, so clear the tail code
>> + * and return.
>> + */
>> + val = atomic_cmpxchg(&lock->val, old, new);
>> + if (old == val)
>> + goto done;
>> + old = val;
>> + }
>> +
> This i need to think about a wee bit; its almost the same...
>
>
> So the below is exactly duplicated from the normal slowpath, so why
> don't you keep that there?
>
> It would get you something like:
>
> if (pv_wait_head_or_steal(..))
> goto stolen;
>
>
> stolen:
>> + /*
>> + * contended path; wait for next, release.
>> + */
>> + while (!(next = READ_ONCE(node->next)))
>> + cpu_relax();
>> +
>> + arch_mcs_spin_unlock_contended(&next->locked);
>> + pv_kick_node(lock, next);
> release:
> ...
Yes, it is largely the same. I thought that you don't like too much
change in the logic flow of the generic qspinlock code. I will make the
change in the next revision.
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 3/5] locking/pvqspinlock: Collect slowpath lock statistics
2015-10-13 20:05 ` Peter Zijlstra
@ 2015-10-13 21:06 ` Waiman Long
0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-10-13 21:06 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On 10/13/2015 04:05 PM, Peter Zijlstra wrote:
> On Tue, Sep 22, 2015 at 04:50:42PM -0400, Waiman Long wrote:
>
>> @@ -100,10 +242,13 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
>> {
>> unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
>> struct pv_hash_entry *he;
>> + int hopcnt = 0;
>>
>> for_each_hash_entry(he, offset, hash) {
>> + hopcnt++;
>> if (!cmpxchg(&he->lock, NULL, lock)) {
>> WRITE_ONCE(he->node, node);
>> + pvstat_hop(hopcnt);
>> return&he->lock;
>> }
>> }
>> @@ -164,9 +309,10 @@ static void pv_init_node(struct mcs_spinlock *node)
>> static void pv_wait_node(struct mcs_spinlock *node)
>> {
>> struct pv_node *pn = (struct pv_node *)node;
>> + int waitcnt = 0;
>> int loop;
>>
>> - for (;;) {
>> + for (;; waitcnt++) {
>> for (loop = SPIN_THRESHOLD; loop; loop--) {
>> if (READ_ONCE(node->locked))
>> return;
>> @@ -250,6 +401,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>> struct pv_node *pn = (struct pv_node *)node;
>> struct __qspinlock *l = (void *)lock;
>> struct qspinlock **lp = NULL;
>> + int waitcnt = 0;
>> int loop;
>>
>> /*
>> @@ -259,7 +411,7 @@ static void pv_wait_head(struct qspinlock *lock, struct mcs_spinlock *node)
>> if (READ_ONCE(pn->state) == vcpu_hashed)
>> lp = (struct qspinlock **)1;
>>
>> - for (;;) {
>> + for (;; waitcnt++) {
>> for (loop = SPIN_THRESHOLD; loop; loop--) {
>> if (!READ_ONCE(l->locked))
>> return;
> These things are ugly; did you verify that they compile away for the
> !stats case?
The waitcnt was added to track if a vCPU was suspended again without
getting the lock after being kicked. I will double check if it will be
compiled away in the !stat case.
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-10-13 20:50 ` Waiman Long
@ 2015-10-14 9:28 ` Peter Zijlstra
2015-10-15 21:01 ` Waiman Long
0 siblings, 1 reply; 22+ messages in thread
From: Peter Zijlstra @ 2015-10-14 9:28 UTC (permalink / raw)
To: Waiman Long
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On Tue, Oct 13, 2015 at 04:50:25PM -0400, Waiman Long wrote:
> On 10/13/2015 03:56 PM, Peter Zijlstra wrote:
> >So the below is exactly duplicated from the normal slowpath, so why
> >don't you keep that there?
> >
> >It would get you something like:
> >
> > if (pv_wait_head_or_steal(..))
> > goto stolen;
> >
> >
> >stolen:
> >>+ /*
> >>+ * contended path; wait for next, release.
> >>+ */
> >>+ while (!(next = READ_ONCE(node->next)))
> >>+ cpu_relax();
> >>+
> >>+ arch_mcs_spin_unlock_contended(&next->locked);
> >>+ pv_kick_node(lock, next);
> >release:
> > ...
>
> Yes, it is largely the same. I thought that you don't like too much change
> in the logic flow of the generic qspinlock code. I will make the change in
> the next revision.
Well, you already put the branch in there, the only difference here is
an 'extra' label. OTOH that extra label avoids duplicating some hairy
code. So over all I would say its a definite win.
And its easy to see it will compile away on the native case where:
#define pv_wait_head_or_steal(l, n, t) (false)
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code
2015-10-13 18:02 ` Peter Zijlstra
2015-10-13 20:38 ` Waiman Long
@ 2015-10-14 9:39 ` Will Deacon
1 sibling, 0 replies; 22+ messages in thread
From: Will Deacon @ 2015-10-14 9:39 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Waiman Long, Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86,
linux-kernel, Scott J Norton, Douglas Hatch, Davidlohr Bueso,
Paul McKenney, boqun.feng
On Tue, Oct 13, 2015 at 08:02:25PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 22, 2015 at 04:50:40PM -0400, Waiman Long wrote:
> > This patch replaces the cmpxchg() and xchg() calls in the native
> > qspinlock code with more relaxed versions of those calls to enable
> > other architectures to adopt queued spinlocks with less performance
> > overhead.
>
> > @@ -62,7 +63,7 @@ static __always_inline int queued_spin_is_contended(struct qspinlock *lock)
> > static __always_inline int queued_spin_trylock(struct qspinlock *lock)
> > {
> > if (!atomic_read(&lock->val) &&
> > - (atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> > + (atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL) == 0))
> > return 1;
> > return 0;
> > }
> > @@ -77,7 +78,7 @@ static __always_inline void queued_spin_lock(struct qspinlock *lock)
> > {
> > u32 val;
> >
> > - val = atomic_cmpxchg(&lock->val, 0, _Q_LOCKED_VAL);
> > + val = atomic_cmpxchg_acquire(&lock->val, 0, _Q_LOCKED_VAL);
> > if (likely(val == 0))
> > return;
> > queued_spin_lock_slowpath(lock, val);
>
> > @@ -319,7 +329,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> > if (val == new)
> > new |= _Q_PENDING_VAL;
> >
> > - old = atomic_cmpxchg(&lock->val, val, new);
> > + old = atomic_cmpxchg_acquire(&lock->val, val, new);
> > if (old == val)
> > break;
> >
>
> So given recent discussion, all this _release/_acquire stuff is starting
> to worry me.
>
> So we've not declared if they should be RCsc or RCpc, and given this
> patch (and the previous ones) these lock primitives turn into RCpc if
> the atomic primitives are RCpc.
Our spinlocks are currently RCpc and I think these really should match.
> So far only the proposed PPC implementation is RCpc -- and their current
> spinlock implementation is also RCpc, but that is a point of discussion.
>
> Just saying..
Well, PPC already made that choice for their spinlocks, so I don't
necessarily think we should worry too much here. They get to choose
between a ~5% performance hit (iirc) or a higher potential for subtle
locking issues. That said, I don't believe the RCpc/RCsc choice actually
affects common locking paradigms (so far, we're only aware of RCU needing
to do something different on PPC).
If we end up getting swamped with horrible bug reports from PPC users,
then we can add a TAINT_RCPC ;) (or just strengthen their implementation).
> Also, I think we should annotate the control dependencies in these
> things.
Yes, please.
Will
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-10-13 20:44 ` Peter Zijlstra
@ 2015-10-15 20:44 ` Waiman Long
0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-10-15 20:44 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On 10/13/2015 04:44 PM, Peter Zijlstra wrote:
> On Tue, Oct 13, 2015 at 04:41:41PM -0400, Waiman Long wrote:
>> On 10/13/2015 02:23 PM, Peter Zijlstra wrote:
>>> On Tue, Sep 22, 2015 at 04:50:43PM -0400, Waiman Long wrote:
>>>> for (;; waitcnt++) {
>>>> + loop = SPIN_THRESHOLD;
>>>> + while (loop) {
>>>> + /*
>>>> + * Spin until the lock is free
>>>> + */
>>>> + for (; loop&& READ_ONCE(l->locked); loop--)
>>>> + cpu_relax();
>>>> + /*
>>>> + * Seeing the lock is free, this queue head vCPU is
>>>> + * the rightful next owner of the lock. However, the
>>>> + * lock may have just been stolen by another task which
>>>> + * has entered the slowpath. So we need to use atomic
>>>> + * operation to make sure that we really get the lock.
>>>> + * Otherwise, we have to wait again.
>>>> + */
>>>> + if (cmpxchg(&l->locked, 0, _Q_LOCKED_VAL) == 0)
>>>> + goto gotlock;
>>>> }
>>> for (loop = SPIN_THRESHOLD; loop; --loop) {
>>> if (!READ_ONCE(l->locked)&&
>>> cmpxchg(&l->locked, 0, _Q_LOCKED_VA) == 0)
>>> goto gotlock;
>>>
>>> cpu_relax();
>>> }
>>>
>> This was the code that I used in my original patch, but it seems to confuse
>> you about doing too many lock stealing. So I separated it out to make my
>> intention more explicit. I will change it back to the old code.
> Code should be compact; its the purpose of Changelogs and comments to
> explain it if its subtle.
>
> Here you made weird code and the comments still don't explain how its
> starvation proof and the Changelog is almost empty of useful.
You are right. The current patch can't guarantee that there will be no
lock starvation. I will make some code changes to make sure that lock
starvation won't happen. I will also try to clean up the code and the
comments at the same time.
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt
2015-10-14 9:28 ` Peter Zijlstra
@ 2015-10-15 21:01 ` Waiman Long
0 siblings, 0 replies; 22+ messages in thread
From: Waiman Long @ 2015-10-15 21:01 UTC (permalink / raw)
To: Peter Zijlstra
Cc: Ingo Molnar, Thomas Gleixner, H. Peter Anvin, x86, linux-kernel,
Scott J Norton, Douglas Hatch, Davidlohr Bueso
On 10/14/2015 05:28 AM, Peter Zijlstra wrote:
> On Tue, Oct 13, 2015 at 04:50:25PM -0400, Waiman Long wrote:
>> On 10/13/2015 03:56 PM, Peter Zijlstra wrote:
>>> So the below is exactly duplicated from the normal slowpath, so why
>>> don't you keep that there?
>>>
>>> It would get you something like:
>>>
>>> if (pv_wait_head_or_steal(..))
>>> goto stolen;
>>>
>>>
>>> stolen:
>>>> + /*
>>>> + * contended path; wait for next, release.
>>>> + */
>>>> + while (!(next = READ_ONCE(node->next)))
>>>> + cpu_relax();
>>>> +
>>>> + arch_mcs_spin_unlock_contended(&next->locked);
>>>> + pv_kick_node(lock, next);
>>> release:
>>> ...
>> Yes, it is largely the same. I thought that you don't like too much change
>> in the logic flow of the generic qspinlock code. I will make the change in
>> the next revision.
> Well, you already put the branch in there, the only difference here is
> an 'extra' label. OTOH that extra label avoids duplicating some hairy
> code. So over all I would say its a definite win.
>
> And its easy to see it will compile away on the native case where:
>
> #define pv_wait_head_or_steal(l, n, t) (false)
>
>
I have done it in a different way to avoid duplicating code. I think I
am not going to use the pv_wait_head_or_steal name after all as I don't
consider acquiring the lock by the queue head CPU is lock stealing. I
will skip "and" and name it as pv_wait_head_lock instead. Please let me
know if you have a better idea.
Cheers,
Longman
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2015-10-15 21:02 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-22 20:50 [PATCH v7 0/5] locking/qspinlock: Enhance pvqspinlock performance Waiman Long
2015-09-22 20:50 ` [PATCH v7 1/5] locking/qspinlock: relaxes cmpxchg & xchg ops in native code Waiman Long
2015-10-13 18:02 ` Peter Zijlstra
2015-10-13 20:38 ` Waiman Long
2015-10-13 20:47 ` Peter Zijlstra
2015-10-14 9:39 ` Will Deacon
2015-09-22 20:50 ` [PATCH v7 2/5] locking/pvqspinlock, x86: Optimize PV unlock code path Waiman Long
2015-09-22 20:50 ` [PATCH v7 3/5] locking/pvqspinlock: Collect slowpath lock statistics Waiman Long
2015-10-13 20:05 ` Peter Zijlstra
2015-10-13 21:06 ` Waiman Long
2015-09-22 20:50 ` [PATCH v7 4/5] locking/pvqspinlock: Allow 1 lock stealing attempt Waiman Long
2015-10-13 18:23 ` Peter Zijlstra
2015-10-13 20:41 ` Waiman Long
2015-10-13 20:44 ` Peter Zijlstra
2015-10-15 20:44 ` Waiman Long
2015-10-13 19:39 ` Peter Zijlstra
2015-10-13 20:45 ` Waiman Long
2015-10-13 19:56 ` Peter Zijlstra
2015-10-13 20:50 ` Waiman Long
2015-10-14 9:28 ` Peter Zijlstra
2015-10-15 21:01 ` Waiman Long
2015-09-22 20:50 ` [PATCH v7 5/5] locking/pvqspinlock: Queue node adaptive spinning Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).