* [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation [not found] <1375324631-32868-1-git-send-email-Waiman.Long@hp.com> @ 2013-08-01 2:37 ` Waiman Long 2013-08-01 2:37 ` Waiman Long ` (2 more replies) 2013-08-01 2:37 ` [PATCH RFC 2/2] qspinlock x86: Enable x86 to use queue spinlock Waiman Long 1 sibling, 3 replies; 26+ messages in thread From: Waiman Long @ 2013-08-01 2:37 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin, Harvey Harrison, Chandramouleeswaran, As This patch introduces a new queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be as fair as the ticket spinlock. It is faster both in single-thread and in high contention situations. Only in light to moderate contention where the average queue depth is around 1-2 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The idea behind this spinlock implementation is the fact that spinlocks are acquired with preemption disabled. In other words, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Of course, interrupt handler can try to acquire one spinlock while the interrupted user process is in the process of getting another spinlock. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 16-bit size. Together with the 1-byte lock bit, this queue spinlock implementation will only need 4 bytes to hold all the information that it needs. The current queue node address encoding is as follows: Bits 0-1 : queue node index in the per-cpu array (4 entries) Bits 2-16: cpu number + 1 (max cpus = 16383) By using also the unused byte, we could support more cpus and queue node entries in the array at the expense of more coding and probably performance overheads. In the extremely unlikely case that all the queue node entries are used up, the current code will fall back to busy spinning without waiting in a queue with warning message. With minor modification to the lock fastpath, we could also make a less fair version of queue spinlock that allows lock stealing. This can potentially provide better performance for some workloads. This patch allows the optional replacement of architecture specific implementation of ticket spinlock by this generic version of queue spinlock. Two new config parameters are introduced: 1. QUEUE_SPINLOCK A select-able option that enables the building and replacement of architecture specific spinlock by queue spinlock. 2. ARCH_QUEUE_SPINLOCK Have to be defined in arch/$(ARCH)/Kconfig to enable QUEUE_SPINLOCK option. This option, by itself, will not enable the queue spinlock feature. For single-thread performance (no contention), a 256K lock/unlock loop was run on a 2.93Ghz Westmere x86-64 CPU. The following table shows the average time (in ns) for a single lock/unlock sequence (including the looping and timing overhead): Lock Type Time (ns) --------- --------- Ticket spinlock 12.1 Queue spinlock 10.9 So the queue spinlock is slightly faster. The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere x86-64 CPUs to evaluate the performance impact of this patch on the 3.10.1 kernel. The jobs per minute (JPM) results for the 1100-2000 user ranges are shown below: +------------+--------+--------+--------+--------+--------+--------+ | Kernel | 3.10.1 | 3.10.1 |%Change | 3.10.1 | 3.10.1 |%Change | | | | qlock | | | qlock | | | HT | off | off | | on | on | | +------------+--------+--------+--------+--------+--------+--------+ |alltests | 403590 | 397622 | -1.5% | 411479 | 411428 | 0.0% | |custom | 304581 | 307165 | +0.9% | 230335 | 254798 | +10.6% | |dbase | 896252 | 900597 | +0.5% |1134659 |1137730 | +0.3% | |disk | 213679 | 270597 | +26.6% | 203974 | 249998 | +22.6% | |five_sec | 144012 | 157286 | +9.2% | 132599 | 150812 | +13.7% | |fserver | 472741 | 467213 | -1.2% | 270601 | 285737 | +5.6% | |high_systime| 131106 | 139015 | +6.0% | 120294 | 125802 | +4.6% | |new_dbase | 934871 | 936860 | +0.2 |1177640 |1181679 | +0.3% | |new_fserver | 452544 | 444967 | -1.7% | 262368 | 275827 | +5.1% | |shared | 366750 | 369901 | +0.9% | 403645 | 415840 | +3.0% | |short |1061663 |3003942 |+183.0% |1040392 |2834437 |+172.4% | +------------+--------+--------+--------+--------+--------+--------+ There are cases where the performance drops 1-2%, but the majority of them are performance gains and some of them are pretty significant. For the disk workload, the spinlock bottleneck is the standalone mb_cache_spinlock in fs/mbcache.c. For the short workload, the spinlock bottleneck is the d_lock in the dentry structure. The following table shows the %time used up by the locks as reported by the perf command at 1000 users for disk workload & 1500 users for short workload with HT off. -----------------+----------+----------+----------+ Lock | ticket | qlock | qlock | | spinlock | fastpath | slowpath | -----------------+----------+----------+----------+ Disk w/o patch | 69.0% | - | - | Disk with patch | - | 0.31% | 73.6% | Short w/o patch | 74.3% | - | - | Short with patch | - | 0.72% | 72.3% | -----------------+----------+----------+----------+ There are 2 observations here: 1. The %CPU time actually increases in the disk workload with the patch. The fact that most of them are spinning on their own cacheline with less congestion can probably skew the number up. 2. Both the short and disk workloads spend about 70% of their time in the lock. However, performance improvement is 2.8X vs 1.3X. This is probably due to the fact that d_lock is embedded next to the d_count to be updated whereas mb_cache_spinlock is in a standalone cacheline not shared by the data to be updated. Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- include/asm-generic/qspinlock.h | 120 ++++++++++++++++++++ lib/Kconfig | 14 +++ lib/Makefile | 1 + lib/qspinlock.c | 237 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 372 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 lib/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 0000000..4cfb369 --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,120 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long@hp.com> + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include <linux/types.h> +#include <asm/cmpxchg.h> +#include <asm/barrier.h> +#include <asm/processor.h> + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define qspinlock arch_spinlock +#endif + +/* + * The queue spinlock data structure + */ +typedef struct qspinlock { + union { + struct { + u8 locked; /* Bit lock */ + u8 reserved; + u16 qcode; /* Wait queue code */ + }; + u32 qlock; + }; +} arch_spinlock_t; + +/* + * External function declarations + */ +extern void queue_spin_lock_slowpath(struct qspinlock *lock); + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return ACCESS_ONCE(lock->locked); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return ACCESS_ONCE(lock->qcode); +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == 0)) + return 1; + return 0; +} + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + if (likely(queue_spin_trylock(lock))) + return; + queue_spin_lock_slowpath(lock); +} + +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_unlock(struct qspinlock *lock) +{ + barrier(); + ACCESS_ONCE(lock->locked) = 0; + smp_wmb(); +} + +/* + * Initializier + */ +#define __ARCH_SPIN_LOCK_UNLOCKED { { .qlock = 0 } } + +#ifndef CONFIG_PARAVIRT_SPINLOCKS +/* + * Remapping spinlock architecture specific functions to the corresponding + * queue spinlock functions. + */ +#define arch_spin_is_locked(l) queue_spin_is_locked(l) +#define arch_spin_is_contended(l) queue_spin_is_contended(l) +#define arch_spin_lock(l) queue_spin_lock(l) +#define arch_spin_trylock(l) queue_spin_trylock(l) +#define arch_spin_unlock(l) queue_spin_unlock(l) +#define arch_spin_lock_flags(l, f) queue_spin_lock(l) +#endif + +#endif /* __ASM_GENERIC_QSPINLOCK_H */ diff --git a/lib/Kconfig b/lib/Kconfig index 35da513..15e0e02 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -412,6 +412,20 @@ config SIGNATURE Implementation is done using GnuPG MPI library # +# Generic queue spinlock +# +config QUEUE_SPINLOCK + bool "Generic queue spinlock" + depends on ARCH_QUEUE_SPINLOCK + default n + help + Use an alternative spinlock implementation that has the same + spinlock structure size but is more optimized for larger + NUMA systems with a lot of CPU cores. Specifically, waiting + lock spinners are put to wait in a queue on a local cacheline + rather than all spinning on the same lock cacheline. + +# # libfdt files, only selected if needed. # config LIBFDT diff --git a/lib/Makefile b/lib/Makefile index 7baccfd..b53710a 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN $@ clean-files += oid_registry_data.c obj-$(CONFIG_UCS2_STRING) += ucs2_string.o +obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o diff --git a/lib/qspinlock.c b/lib/qspinlock.c new file mode 100644 index 0000000..21d5f6e --- /dev/null +++ b/lib/qspinlock.c @@ -0,0 +1,237 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long@hp.com> + */ +#include <linux/smp.h> +#include <linux/bug.h> +#include <linux/cpumask.h> +#include <linux/percpu.h> +#include <linux/hardirq.h> +#include <asm-generic/qspinlock.h> + +/* + * The queue spinlock fastpath is as simple as it can get, all the heavy + * lifting is done in the lock slowpath. The main idea behind this queue + * spinlock implementation is to keep the spinlock size at 4 bytes while + * at the same time implement a queue structure to queue up the waiting + * lock spinners. + * + * Since preemption is disabled before getting the lock, a given CPU will + * only need to use one queue node structure in a non-interrupt context. + * A percpu queue node structure will be allocated for this purpose and the + * cpu number will be put into the queue spinlock structure to indicate the + * tail of the queue. + * + * To handle spinlock acquisition at interrupt context (softirq or hardirq), + * the queue node structure is actually an array for supporting nested spin + * locking operations in interrupt handlers. If all the entries in the + * array are used up, a warning message will be printed (as that shouldn't + * happen in normal circumstances) and the lock spinner will fall back to + * busy spinning instead of waiting in a queue. + */ + +/* +#ifndef CONFIG_DEBUG_SPINLOCK +#define CONFIG_DEBUG_SPINLOCK 1 +#endif + */ + +/* + * The queue node structure + */ +struct qnode { + struct qnode *next; + u8 wait; /* Waiting flag */ + u8 used; /* Used flag */ +#ifdef CONFIG_DEBUG_SPINLOCK + u16 cpu_nr; /* CPU number */ + void *lock; /* Lock address */ +#endif +}; + +/* + * The 16-bit wait queue code is divided into the following 2 fields: + * Bits 0-1 : queue node index + * Bits 2-15: cpu number + 1 + * + * The current implementation will allow a maximum of (1<<14)-1 = 16383 CPUs. + */ +#define GET_QN_IDX(code) ((code) & 0x3) +#define GET_CPU_NR(code) (((code) >> 2) - 1) +#define SET_CPU_CODE(cpu, idx) ((((cpu) + 1) << 2) | idx) +#define MAX_QNODES 4 +#define MAX_CPUS ((1<<14)-1) + +/* + * Per-CPU queue node structures + */ +static DEFINE_PER_CPU(struct qnode [MAX_QNODES], qnodes) = { { 0 } }; + +/** + * queue_trylock - try to acquire the lock bit ignoring the qcode in lock + * @lock: Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_trylock(struct qspinlock *lock) +{ + if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0)) + return 1; + return 0; +} + +/** + * queue_spin_lock_slowpath - acquire the queue spinlock + * @lock: Pointer to queue spinlock structure + */ +void queue_spin_lock_slowpath(struct qspinlock *lock) +{ + unsigned int cpu_nr, qn_idx; + struct qnode *node, *prev, *next; + u16 oldcode, newcode; + + /* + * The current code does not work with more than 64K-1 cpus. + */ + BUILD_BUG_ON(CONFIG_NR_CPUS >= MAX_CPUS); + + /* + * Get the queue node + */ + cpu_nr = smp_processor_id(); + node = per_cpu_ptr(&qnodes[0], cpu_nr); + qn_idx = 0; + + if (unlikely(node->used)) { + /* + * The node has bee used, try to find an empty queue node entry + */ + for (qn_idx = 1; qn_idx < MAX_QNODES; qn_idx++) + if (!node[qn_idx].used) + break; + if (unlikely(qn_idx == MAX_QNODES)) { + /* + * This shouldn't happen, print a warning message + * & busy spinning on the lock. + */ + pr_warn("qspinlock: queue node table exhausted at " + "cpu %d!\n", cpu_nr); + while (!queue_trylock(lock)) + cpu_relax(); + return; + } + /* Adjust node pointer */ + node += qn_idx; + } + + /* + * Set up the new cpu code to be exchanged + */ + newcode = SET_CPU_CODE(cpu_nr, qn_idx); + + /* + * The lock may be available at this point, try again before waiting + * in a queue. + */ + if (queue_spin_trylock(lock)) + return; + + /* + * Set up the queue node + */ +#ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(node->used); +#endif + node->used = true; + barrier(); +#ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(node->lock || node->next || node->wait); + node->cpu_nr = cpu_nr; + node->lock = (void *)lock; +#endif + node->wait = true; + node->next = NULL; + + /* + * Exchange current copy of the cpu code + */ + oldcode = xchg(&lock->qcode, newcode); + if (oldcode) { + /* + * Not at the queue head, get the address of the previous node + * and setup the "next" pointer of that node. + */ + prev = per_cpu_ptr(&qnodes[GET_QN_IDX(oldcode)], + GET_CPU_NR(oldcode)); +#ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(prev->cpu_nr == cpu_nr); + BUG_ON(prev->lock != (void *)lock); + BUG_ON(!prev->used || (prev->next != NULL)); +#endif + ACCESS_ONCE(prev->next) = node; + smp_wmb(); + /* + * Wait until the waiting flag is off + */ + while (ACCESS_ONCE(node->wait)) + cpu_relax(); + } + + /* + * At the head of the wait queue now, try to get the lock in a + * spin loop. + */ + while (!queue_trylock(lock)) + cpu_relax(); + + next = ACCESS_ONCE(node->next); + if (next) + goto notify_next; + /* + * Clear the wait queue if it is the last node + */ + if ((ACCESS_ONCE(lock->qcode) == newcode) && + (cmpxchg(&lock->qcode, newcode, 0) == newcode)) + goto release_qnode; + /* + * Wait until the next one in queue set up the next field + */ + while (!(next = ACCESS_ONCE(node->next))) + cpu_relax(); +notify_next: +#ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(next->cpu_nr == cpu_nr); + BUG_ON(next->lock != (void *)lock); + BUG_ON(!next->wait || !next->used); +#endif + /* + * The next one in queue is now at the head + */ + ACCESS_ONCE(next->wait) = false; + +release_qnode: + /* + * Release the queue node + */ + node->next = NULL; + node->used = false; + node->wait = false; +#ifdef CONFIG_DEBUG_SPINLOCK + node->lock = NULL; + BUG_ON(cpu_nr != smp_processor_id()); +#endif + smp_wmb(); +} +EXPORT_SYMBOL(queue_spin_lock_slowpath); -- 1.7.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 2:37 ` [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long @ 2013-08-01 2:37 ` Waiman Long [not found] ` <20130801094029.GK3008@twins.programming.kicks-ass.net> 2013-08-01 20:23 ` Raghavendra K T 2 siblings, 0 replies; 26+ messages in thread From: Waiman Long @ 2013-08-01 2:37 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J This patch introduces a new queue spinlock implementation that can serve as an alternative to the default ticket spinlock. Compared with the ticket spinlock, this queue spinlock should be as fair as the ticket spinlock. It is faster both in single-thread and in high contention situations. Only in light to moderate contention where the average queue depth is around 1-2 will this queue spinlock be potentially a bit slower due to the higher slowpath overhead. This queue spinlock is especially suit to NUMA machines with a large number of cores as the chance of spinlock contention is much higher in those machines. The idea behind this spinlock implementation is the fact that spinlocks are acquired with preemption disabled. In other words, the process will not be migrated to another CPU while it is trying to get a spinlock. Ignoring interrupt handling, a CPU can only be contending in one spinlock at any one time. Of course, interrupt handler can try to acquire one spinlock while the interrupted user process is in the process of getting another spinlock. By allocating a set of per-cpu queue nodes and used them to form a waiting queue, we can encode the queue node address into a much smaller 16-bit size. Together with the 1-byte lock bit, this queue spinlock implementation will only need 4 bytes to hold all the information that it needs. The current queue node address encoding is as follows: Bits 0-1 : queue node index in the per-cpu array (4 entries) Bits 2-16: cpu number + 1 (max cpus = 16383) By using also the unused byte, we could support more cpus and queue node entries in the array at the expense of more coding and probably performance overheads. In the extremely unlikely case that all the queue node entries are used up, the current code will fall back to busy spinning without waiting in a queue with warning message. With minor modification to the lock fastpath, we could also make a less fair version of queue spinlock that allows lock stealing. This can potentially provide better performance for some workloads. This patch allows the optional replacement of architecture specific implementation of ticket spinlock by this generic version of queue spinlock. Two new config parameters are introduced: 1. QUEUE_SPINLOCK A select-able option that enables the building and replacement of architecture specific spinlock by queue spinlock. 2. ARCH_QUEUE_SPINLOCK Have to be defined in arch/$(ARCH)/Kconfig to enable QUEUE_SPINLOCK option. This option, by itself, will not enable the queue spinlock feature. For single-thread performance (no contention), a 256K lock/unlock loop was run on a 2.93Ghz Westmere x86-64 CPU. The following table shows the average time (in ns) for a single lock/unlock sequence (including the looping and timing overhead): Lock Type Time (ns) --------- --------- Ticket spinlock 12.1 Queue spinlock 10.9 So the queue spinlock is slightly faster. The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere x86-64 CPUs to evaluate the performance impact of this patch on the 3.10.1 kernel. The jobs per minute (JPM) results for the 1100-2000 user ranges are shown below: +------------+--------+--------+--------+--------+--------+--------+ | Kernel | 3.10.1 | 3.10.1 |%Change | 3.10.1 | 3.10.1 |%Change | | | | qlock | | | qlock | | | HT | off | off | | on | on | | +------------+--------+--------+--------+--------+--------+--------+ |alltests | 403590 | 397622 | -1.5% | 411479 | 411428 | 0.0% | |custom | 304581 | 307165 | +0.9% | 230335 | 254798 | +10.6% | |dbase | 896252 | 900597 | +0.5% |1134659 |1137730 | +0.3% | |disk | 213679 | 270597 | +26.6% | 203974 | 249998 | +22.6% | |five_sec | 144012 | 157286 | +9.2% | 132599 | 150812 | +13.7% | |fserver | 472741 | 467213 | -1.2% | 270601 | 285737 | +5.6% | |high_systime| 131106 | 139015 | +6.0% | 120294 | 125802 | +4.6% | |new_dbase | 934871 | 936860 | +0.2 |1177640 |1181679 | +0.3% | |new_fserver | 452544 | 444967 | -1.7% | 262368 | 275827 | +5.1% | |shared | 366750 | 369901 | +0.9% | 403645 | 415840 | +3.0% | |short |1061663 |3003942 |+183.0% |1040392 |2834437 |+172.4% | +------------+--------+--------+--------+--------+--------+--------+ There are cases where the performance drops 1-2%, but the majority of them are performance gains and some of them are pretty significant. For the disk workload, the spinlock bottleneck is the standalone mb_cache_spinlock in fs/mbcache.c. For the short workload, the spinlock bottleneck is the d_lock in the dentry structure. The following table shows the %time used up by the locks as reported by the perf command at 1000 users for disk workload & 1500 users for short workload with HT off. -----------------+----------+----------+----------+ Lock | ticket | qlock | qlock | | spinlock | fastpath | slowpath | -----------------+----------+----------+----------+ Disk w/o patch | 69.0% | - | - | Disk with patch | - | 0.31% | 73.6% | Short w/o patch | 74.3% | - | - | Short with patch | - | 0.72% | 72.3% | -----------------+----------+----------+----------+ There are 2 observations here: 1. The %CPU time actually increases in the disk workload with the patch. The fact that most of them are spinning on their own cacheline with less congestion can probably skew the number up. 2. Both the short and disk workloads spend about 70% of their time in the lock. However, performance improvement is 2.8X vs 1.3X. This is probably due to the fact that d_lock is embedded next to the d_count to be updated whereas mb_cache_spinlock is in a standalone cacheline not shared by the data to be updated. Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- include/asm-generic/qspinlock.h | 120 ++++++++++++++++++++ lib/Kconfig | 14 +++ lib/Makefile | 1 + lib/qspinlock.c | 237 +++++++++++++++++++++++++++++++++++++++ 4 files changed, 372 insertions(+), 0 deletions(-) create mode 100644 include/asm-generic/qspinlock.h create mode 100644 lib/qspinlock.c diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h new file mode 100644 index 0000000..4cfb369 --- /dev/null +++ b/include/asm-generic/qspinlock.h @@ -0,0 +1,120 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long@hp.com> + */ +#ifndef __ASM_GENERIC_QSPINLOCK_H +#define __ASM_GENERIC_QSPINLOCK_H + +#include <linux/types.h> +#include <asm/cmpxchg.h> +#include <asm/barrier.h> +#include <asm/processor.h> + +#ifdef CONFIG_PARAVIRT_SPINLOCKS +#define qspinlock arch_spinlock +#endif + +/* + * The queue spinlock data structure + */ +typedef struct qspinlock { + union { + struct { + u8 locked; /* Bit lock */ + u8 reserved; + u16 qcode; /* Wait queue code */ + }; + u32 qlock; + }; +} arch_spinlock_t; + +/* + * External function declarations + */ +extern void queue_spin_lock_slowpath(struct qspinlock *lock); + +/** + * queue_spin_is_locked - is the spinlock locked? + * @lock: Pointer to queue spinlock structure + * Return: 1 if it is locked, 0 otherwise + */ +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) +{ + return ACCESS_ONCE(lock->locked); +} + +/** + * queue_spin_is_contended - check if the lock is contended + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock contended, 0 otherwise + */ +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) +{ + return ACCESS_ONCE(lock->qcode); +} +/** + * queue_spin_trylock - try to acquire the queue spinlock + * @lock : Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_spin_trylock(struct qspinlock *lock) +{ + if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == 0)) + return 1; + return 0; +} + +/** + * queue_spin_lock - acquire a queue spinlock + * @lock: Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_lock(struct qspinlock *lock) +{ + if (likely(queue_spin_trylock(lock))) + return; + queue_spin_lock_slowpath(lock); +} + +/** + * queue_spin_unlock - release a queue spinlock + * @lock : Pointer to queue spinlock structure + */ +static __always_inline void queue_spin_unlock(struct qspinlock *lock) +{ + barrier(); + ACCESS_ONCE(lock->locked) = 0; + smp_wmb(); +} + +/* + * Initializier + */ +#define __ARCH_SPIN_LOCK_UNLOCKED { { .qlock = 0 } } + +#ifndef CONFIG_PARAVIRT_SPINLOCKS +/* + * Remapping spinlock architecture specific functions to the corresponding + * queue spinlock functions. + */ +#define arch_spin_is_locked(l) queue_spin_is_locked(l) +#define arch_spin_is_contended(l) queue_spin_is_contended(l) +#define arch_spin_lock(l) queue_spin_lock(l) +#define arch_spin_trylock(l) queue_spin_trylock(l) +#define arch_spin_unlock(l) queue_spin_unlock(l) +#define arch_spin_lock_flags(l, f) queue_spin_lock(l) +#endif + +#endif /* __ASM_GENERIC_QSPINLOCK_H */ diff --git a/lib/Kconfig b/lib/Kconfig index 35da513..15e0e02 100644 --- a/lib/Kconfig +++ b/lib/Kconfig @@ -412,6 +412,20 @@ config SIGNATURE Implementation is done using GnuPG MPI library # +# Generic queue spinlock +# +config QUEUE_SPINLOCK + bool "Generic queue spinlock" + depends on ARCH_QUEUE_SPINLOCK + default n + help + Use an alternative spinlock implementation that has the same + spinlock structure size but is more optimized for larger + NUMA systems with a lot of CPU cores. Specifically, waiting + lock spinners are put to wait in a queue on a local cacheline + rather than all spinning on the same lock cacheline. + +# # libfdt files, only selected if needed. # config LIBFDT diff --git a/lib/Makefile b/lib/Makefile index 7baccfd..b53710a 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN $@ clean-files += oid_registry_data.c obj-$(CONFIG_UCS2_STRING) += ucs2_string.o +obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o diff --git a/lib/qspinlock.c b/lib/qspinlock.c new file mode 100644 index 0000000..21d5f6e --- /dev/null +++ b/lib/qspinlock.c @@ -0,0 +1,237 @@ +/* + * Queue spinlock + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. + * + * Authors: Waiman Long <waiman.long@hp.com> + */ +#include <linux/smp.h> +#include <linux/bug.h> +#include <linux/cpumask.h> +#include <linux/percpu.h> +#include <linux/hardirq.h> +#include <asm-generic/qspinlock.h> + +/* + * The queue spinlock fastpath is as simple as it can get, all the heavy + * lifting is done in the lock slowpath. The main idea behind this queue + * spinlock implementation is to keep the spinlock size at 4 bytes while + * at the same time implement a queue structure to queue up the waiting + * lock spinners. + * + * Since preemption is disabled before getting the lock, a given CPU will + * only need to use one queue node structure in a non-interrupt context. + * A percpu queue node structure will be allocated for this purpose and the + * cpu number will be put into the queue spinlock structure to indicate the + * tail of the queue. + * + * To handle spinlock acquisition at interrupt context (softirq or hardirq), + * the queue node structure is actually an array for supporting nested spin + * locking operations in interrupt handlers. If all the entries in the + * array are used up, a warning message will be printed (as that shouldn't + * happen in normal circumstances) and the lock spinner will fall back to + * busy spinning instead of waiting in a queue. + */ + +/* +#ifndef CONFIG_DEBUG_SPINLOCK +#define CONFIG_DEBUG_SPINLOCK 1 +#endif + */ + +/* + * The queue node structure + */ +struct qnode { + struct qnode *next; + u8 wait; /* Waiting flag */ + u8 used; /* Used flag */ +#ifdef CONFIG_DEBUG_SPINLOCK + u16 cpu_nr; /* CPU number */ + void *lock; /* Lock address */ +#endif +}; + +/* + * The 16-bit wait queue code is divided into the following 2 fields: + * Bits 0-1 : queue node index + * Bits 2-15: cpu number + 1 + * + * The current implementation will allow a maximum of (1<<14)-1 = 16383 CPUs. + */ +#define GET_QN_IDX(code) ((code) & 0x3) +#define GET_CPU_NR(code) (((code) >> 2) - 1) +#define SET_CPU_CODE(cpu, idx) ((((cpu) + 1) << 2) | idx) +#define MAX_QNODES 4 +#define MAX_CPUS ((1<<14)-1) + +/* + * Per-CPU queue node structures + */ +static DEFINE_PER_CPU(struct qnode [MAX_QNODES], qnodes) = { { 0 } }; + +/** + * queue_trylock - try to acquire the lock bit ignoring the qcode in lock + * @lock: Pointer to queue spinlock structure + * Return: 1 if lock acquired, 0 if failed + */ +static __always_inline int queue_trylock(struct qspinlock *lock) +{ + if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0)) + return 1; + return 0; +} + +/** + * queue_spin_lock_slowpath - acquire the queue spinlock + * @lock: Pointer to queue spinlock structure + */ +void queue_spin_lock_slowpath(struct qspinlock *lock) +{ + unsigned int cpu_nr, qn_idx; + struct qnode *node, *prev, *next; + u16 oldcode, newcode; + + /* + * The current code does not work with more than 64K-1 cpus. + */ + BUILD_BUG_ON(CONFIG_NR_CPUS >= MAX_CPUS); + + /* + * Get the queue node + */ + cpu_nr = smp_processor_id(); + node = per_cpu_ptr(&qnodes[0], cpu_nr); + qn_idx = 0; + + if (unlikely(node->used)) { + /* + * The node has bee used, try to find an empty queue node entry + */ + for (qn_idx = 1; qn_idx < MAX_QNODES; qn_idx++) + if (!node[qn_idx].used) + break; + if (unlikely(qn_idx == MAX_QNODES)) { + /* + * This shouldn't happen, print a warning message + * & busy spinning on the lock. + */ + pr_warn("qspinlock: queue node table exhausted at " + "cpu %d!\n", cpu_nr); + while (!queue_trylock(lock)) + cpu_relax(); + return; + } + /* Adjust node pointer */ + node += qn_idx; + } + + /* + * Set up the new cpu code to be exchanged + */ + newcode = SET_CPU_CODE(cpu_nr, qn_idx); + + /* + * The lock may be available at this point, try again before waiting + * in a queue. + */ + if (queue_spin_trylock(lock)) + return; + + /* + * Set up the queue node + */ +#ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(node->used); +#endif + node->used = true; + barrier(); +#ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(node->lock || node->next || node->wait); + node->cpu_nr = cpu_nr; + node->lock = (void *)lock; +#endif + node->wait = true; + node->next = NULL; + + /* + * Exchange current copy of the cpu code + */ + oldcode = xchg(&lock->qcode, newcode); + if (oldcode) { + /* + * Not at the queue head, get the address of the previous node + * and setup the "next" pointer of that node. + */ + prev = per_cpu_ptr(&qnodes[GET_QN_IDX(oldcode)], + GET_CPU_NR(oldcode)); +#ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(prev->cpu_nr == cpu_nr); + BUG_ON(prev->lock != (void *)lock); + BUG_ON(!prev->used || (prev->next != NULL)); +#endif + ACCESS_ONCE(prev->next) = node; + smp_wmb(); + /* + * Wait until the waiting flag is off + */ + while (ACCESS_ONCE(node->wait)) + cpu_relax(); + } + + /* + * At the head of the wait queue now, try to get the lock in a + * spin loop. + */ + while (!queue_trylock(lock)) + cpu_relax(); + + next = ACCESS_ONCE(node->next); + if (next) + goto notify_next; + /* + * Clear the wait queue if it is the last node + */ + if ((ACCESS_ONCE(lock->qcode) == newcode) && + (cmpxchg(&lock->qcode, newcode, 0) == newcode)) + goto release_qnode; + /* + * Wait until the next one in queue set up the next field + */ + while (!(next = ACCESS_ONCE(node->next))) + cpu_relax(); +notify_next: +#ifdef CONFIG_DEBUG_SPINLOCK + BUG_ON(next->cpu_nr == cpu_nr); + BUG_ON(next->lock != (void *)lock); + BUG_ON(!next->wait || !next->used); +#endif + /* + * The next one in queue is now at the head + */ + ACCESS_ONCE(next->wait) = false; + +release_qnode: + /* + * Release the queue node + */ + node->next = NULL; + node->used = false; + node->wait = false; +#ifdef CONFIG_DEBUG_SPINLOCK + node->lock = NULL; + BUG_ON(cpu_nr != smp_processor_id()); +#endif + smp_wmb(); +} +EXPORT_SYMBOL(queue_spin_lock_slowpath); -- 1.7.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
[parent not found: <20130801094029.GK3008@twins.programming.kicks-ass.net>]
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation [not found] ` <20130801094029.GK3008@twins.programming.kicks-ass.net> @ 2013-08-01 10:11 ` Raghavendra K T 2013-08-01 10:11 ` Raghavendra K T ` (2 more replies) [not found] ` <51FAA1C3.2050507@hp.com> 1 sibling, 3 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-01 10:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison <harvey.ha> On 08/01/2013 03:10 PM, Peter Zijlstra wrote: > On Wed, Jul 31, 2013 at 10:37:10PM -0400, Waiman Long wrote: > > OK, so over-all I rather like the thing. It might be good to include a > link to some MCS lock description, sadly wikipedia doesn't have an > article on the concept :/ > > http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf > > That seems like nice (short-ish) write-up of the general algorithm. > >> +typedef struct qspinlock { >> + union { >> + struct { >> + u8 locked; /* Bit lock */ >> + u8 reserved; >> + u16 qcode; /* Wait queue code */ >> + }; >> + u32 qlock; >> + }; >> +} arch_spinlock_t; > >> +static __always_inline void queue_spin_unlock(struct qspinlock *lock) >> +{ >> + barrier(); >> + ACCESS_ONCE(lock->locked) = 0; > > Its always good to add comments with barriers.. > >> + smp_wmb(); >> +} > >> +/* >> + * The queue node structure >> + */ >> +struct qnode { >> + struct qnode *next; >> + u8 wait; /* Waiting flag */ >> + u8 used; /* Used flag */ >> +#ifdef CONFIG_DEBUG_SPINLOCK >> + u16 cpu_nr; /* CPU number */ >> + void *lock; /* Lock address */ >> +#endif >> +}; >> + >> +/* >> + * The 16-bit wait queue code is divided into the following 2 fields: >> + * Bits 0-1 : queue node index >> + * Bits 2-15: cpu number + 1 >> + * >> + * The current implementation will allow a maximum of (1<<14)-1 = 16383 CPUs. > > I haven't yet read far enough to figure out why you need the -1 thing, > but effectively you're restricted to 15k due to this. > It is exactly 16k-1 not 15k That is because CPU_CODE of 1 to 16k represents cpu 0..16k-1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 10:11 ` Raghavendra K T @ 2013-08-01 10:11 ` Raghavendra K T 2013-08-01 10:12 ` Peter Zijlstra 2013-08-01 10:14 ` Peter Zijlstra 2 siblings, 0 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-01 10:11 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On 08/01/2013 03:10 PM, Peter Zijlstra wrote: > On Wed, Jul 31, 2013 at 10:37:10PM -0400, Waiman Long wrote: > > OK, so over-all I rather like the thing. It might be good to include a > link to some MCS lock description, sadly wikipedia doesn't have an > article on the concept :/ > > http://www.cise.ufl.edu/tr/DOC/REP-1992-71.pdf > > That seems like nice (short-ish) write-up of the general algorithm. > >> +typedef struct qspinlock { >> + union { >> + struct { >> + u8 locked; /* Bit lock */ >> + u8 reserved; >> + u16 qcode; /* Wait queue code */ >> + }; >> + u32 qlock; >> + }; >> +} arch_spinlock_t; > >> +static __always_inline void queue_spin_unlock(struct qspinlock *lock) >> +{ >> + barrier(); >> + ACCESS_ONCE(lock->locked) = 0; > > Its always good to add comments with barriers.. > >> + smp_wmb(); >> +} > >> +/* >> + * The queue node structure >> + */ >> +struct qnode { >> + struct qnode *next; >> + u8 wait; /* Waiting flag */ >> + u8 used; /* Used flag */ >> +#ifdef CONFIG_DEBUG_SPINLOCK >> + u16 cpu_nr; /* CPU number */ >> + void *lock; /* Lock address */ >> +#endif >> +}; >> + >> +/* >> + * The 16-bit wait queue code is divided into the following 2 fields: >> + * Bits 0-1 : queue node index >> + * Bits 2-15: cpu number + 1 >> + * >> + * The current implementation will allow a maximum of (1<<14)-1 = 16383 CPUs. > > I haven't yet read far enough to figure out why you need the -1 thing, > but effectively you're restricted to 15k due to this. > It is exactly 16k-1 not 15k That is because CPU_CODE of 1 to 16k represents cpu 0..16k-1 ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 10:11 ` Raghavendra K T 2013-08-01 10:11 ` Raghavendra K T @ 2013-08-01 10:12 ` Peter Zijlstra 2013-08-01 10:12 ` Peter Zijlstra 2013-08-01 10:14 ` Peter Zijlstra 2 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2013-08-01 10:12 UTC (permalink / raw) To: Raghavendra K T Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison <harvey.ha> On Thu, Aug 01, 2013 at 03:41:33PM +0530, Raghavendra K T wrote: > It is exactly 16k-1 not 15k > That is because CPU_CODE of 1 to 16k represents cpu 0..16k-1 From what I know big systems are usually build with power-of-two factors. Although I suppose with a ring fabric you could have an arbitrary number of nodes. Anyway, I've heard SGI talk about 4K cpu systems, 8K cpu systems and 16K cpu systems, I've not heard them talk about 16K-n systems. Also, as in other parts of the reply I send, this limitation seems completely unnecessary. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 10:12 ` Peter Zijlstra @ 2013-08-01 10:12 ` Peter Zijlstra 0 siblings, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2013-08-01 10:12 UTC (permalink / raw) To: Raghavendra K T Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On Thu, Aug 01, 2013 at 03:41:33PM +0530, Raghavendra K T wrote: > It is exactly 16k-1 not 15k > That is because CPU_CODE of 1 to 16k represents cpu 0..16k-1 From what I know big systems are usually build with power-of-two factors. Although I suppose with a ring fabric you could have an arbitrary number of nodes. Anyway, I've heard SGI talk about 4K cpu systems, 8K cpu systems and 16K cpu systems, I've not heard them talk about 16K-n systems. Also, as in other parts of the reply I send, this limitation seems completely unnecessary. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 10:11 ` Raghavendra K T 2013-08-01 10:11 ` Raghavendra K T 2013-08-01 10:12 ` Peter Zijlstra @ 2013-08-01 10:14 ` Peter Zijlstra 2013-08-01 10:14 ` Peter Zijlstra 2 siblings, 1 reply; 26+ messages in thread From: Peter Zijlstra @ 2013-08-01 10:14 UTC (permalink / raw) To: Raghavendra K T Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison <harvey.ha> On Thu, Aug 01, 2013 at 03:41:33PM +0530, Raghavendra K T wrote: > It is exactly 16k-1 not 15k > That is because CPU_CODE of 1 to 16k represents cpu 0..16k-1 > More specifically: + BUILD_BUG_ON(CONFIG_NR_CPUS >= MAX_CPUS); Our NR_CPUS is very much a power of two. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 10:14 ` Peter Zijlstra @ 2013-08-01 10:14 ` Peter Zijlstra 0 siblings, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2013-08-01 10:14 UTC (permalink / raw) To: Raghavendra K T Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On Thu, Aug 01, 2013 at 03:41:33PM +0530, Raghavendra K T wrote: > It is exactly 16k-1 not 15k > That is because CPU_CODE of 1 to 16k represents cpu 0..16k-1 > More specifically: + BUILD_BUG_ON(CONFIG_NR_CPUS >= MAX_CPUS); Our NR_CPUS is very much a power of two. ^ permalink raw reply [flat|nested] 26+ messages in thread
[parent not found: <51FAA1C3.2050507@hp.com>]
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation [not found] ` <51FAA1C3.2050507@hp.com> @ 2013-08-01 18:16 ` Raghavendra K T 2013-08-01 18:16 ` Raghavendra K T 2013-08-01 20:10 ` Peter Zijlstra 0 siblings, 2 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-01 18:16 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison <harv> On 08/01/2013 11:28 PM, Waiman Long wrote: > On 08/01/2013 05:40 AM, Peter Zijlstra wrote: >> On Wed, Jul 31, 2013 at 10:37:10PM -0400, Waiman Long wrote: >> [...] >> >>> + */ >>> + for (qn_idx = 1; qn_idx< MAX_QNODES; qn_idx++) { >>> + if (!node[qn_idx].used) >>> + break; >> } >> >>> + if (unlikely(qn_idx == MAX_QNODES)) { >>> + /* >>> + * This shouldn't happen, print a warning message >>> + *& busy spinning on the lock. >>> + */ >>> + pr_warn("qspinlock: queue node table exhausted at " >>> + "cpu %d!\n", cpu_nr); >> This could make your machine die hard.. not all contexts can printk(). > > Do you have any suggestion? I could skip the warning and silently do the > busy spinning. I just want some way to notify the user of this rare event. We have used debugfs in pv-spinlock to avoid that since printk uses spinlock again. may be it will help to profile many other parts of code too. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 18:16 ` Raghavendra K T @ 2013-08-01 18:16 ` Raghavendra K T 2013-08-01 20:10 ` Peter Zijlstra 1 sibling, 0 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-01 18:16 UTC (permalink / raw) To: Waiman Long Cc: Peter Zijlstra, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On 08/01/2013 11:28 PM, Waiman Long wrote: > On 08/01/2013 05:40 AM, Peter Zijlstra wrote: >> On Wed, Jul 31, 2013 at 10:37:10PM -0400, Waiman Long wrote: >> [...] >> >>> + */ >>> + for (qn_idx = 1; qn_idx< MAX_QNODES; qn_idx++) { >>> + if (!node[qn_idx].used) >>> + break; >> } >> >>> + if (unlikely(qn_idx == MAX_QNODES)) { >>> + /* >>> + * This shouldn't happen, print a warning message >>> + *& busy spinning on the lock. >>> + */ >>> + pr_warn("qspinlock: queue node table exhausted at " >>> + "cpu %d!\n", cpu_nr); >> This could make your machine die hard.. not all contexts can printk(). > > Do you have any suggestion? I could skip the warning and silently do the > busy spinning. I just want some way to notify the user of this rare event. We have used debugfs in pv-spinlock to avoid that since printk uses spinlock again. may be it will help to profile many other parts of code too. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 18:16 ` Raghavendra K T 2013-08-01 18:16 ` Raghavendra K T @ 2013-08-01 20:10 ` Peter Zijlstra 2013-08-01 20:10 ` Peter Zijlstra 2013-08-01 20:36 ` Raghavendra K T 1 sibling, 2 replies; 26+ messages in thread From: Peter Zijlstra @ 2013-08-01 20:10 UTC (permalink / raw) To: Raghavendra K T Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison <harvey.ha> On Thu, Aug 01, 2013 at 11:46:24PM +0530, Raghavendra K T wrote: > On 08/01/2013 11:28 PM, Waiman Long wrote: > >On 08/01/2013 05:40 AM, Peter Zijlstra wrote: > >>On Wed, Jul 31, 2013 at 10:37:10PM -0400, Waiman Long wrote: > >> > [...] > >> > >>>+ */ > >>>+ for (qn_idx = 1; qn_idx< MAX_QNODES; qn_idx++) { > >>>+ if (!node[qn_idx].used) > >>>+ break; > >> } > >> > >>>+ if (unlikely(qn_idx == MAX_QNODES)) { > >>>+ /* > >>>+ * This shouldn't happen, print a warning message > >>>+ *& busy spinning on the lock. > >>>+ */ > >>>+ pr_warn("qspinlock: queue node table exhausted at " > >>>+ "cpu %d!\n", cpu_nr); > >>This could make your machine die hard.. not all contexts can printk(). > > > >Do you have any suggestion? I could skip the warning and silently do the > >busy spinning. I just want some way to notify the user of this rare event. > > We have used debugfs in pv-spinlock to avoid that since printk uses > spinlock again. may be it will help to profile many other parts of > code too. I always use early_printk(), but that requires you set up your serial console properly and joe-user won't have done that. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 20:10 ` Peter Zijlstra @ 2013-08-01 20:10 ` Peter Zijlstra 2013-08-01 20:36 ` Raghavendra K T 1 sibling, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2013-08-01 20:10 UTC (permalink / raw) To: Raghavendra K T Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On Thu, Aug 01, 2013 at 11:46:24PM +0530, Raghavendra K T wrote: > On 08/01/2013 11:28 PM, Waiman Long wrote: > >On 08/01/2013 05:40 AM, Peter Zijlstra wrote: > >>On Wed, Jul 31, 2013 at 10:37:10PM -0400, Waiman Long wrote: > >> > [...] > >> > >>>+ */ > >>>+ for (qn_idx = 1; qn_idx< MAX_QNODES; qn_idx++) { > >>>+ if (!node[qn_idx].used) > >>>+ break; > >> } > >> > >>>+ if (unlikely(qn_idx == MAX_QNODES)) { > >>>+ /* > >>>+ * This shouldn't happen, print a warning message > >>>+ *& busy spinning on the lock. > >>>+ */ > >>>+ pr_warn("qspinlock: queue node table exhausted at " > >>>+ "cpu %d!\n", cpu_nr); > >>This could make your machine die hard.. not all contexts can printk(). > > > >Do you have any suggestion? I could skip the warning and silently do the > >busy spinning. I just want some way to notify the user of this rare event. > > We have used debugfs in pv-spinlock to avoid that since printk uses > spinlock again. may be it will help to profile many other parts of > code too. I always use early_printk(), but that requires you set up your serial console properly and joe-user won't have done that. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 20:10 ` Peter Zijlstra 2013-08-01 20:10 ` Peter Zijlstra @ 2013-08-01 20:36 ` Raghavendra K T 2013-08-01 20:36 ` Raghavendra K T 1 sibling, 1 reply; 26+ messages in thread From: Raghavendra K T @ 2013-08-01 20:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison <harvey.ha> On 08/02/2013 01:40 AM, Peter Zijlstra wrote: > On Thu, Aug 01, 2013 at 11:46:24PM +0530, Raghavendra K T wrote: >> On 08/01/2013 11:28 PM, Waiman Long wrote: >>> On 08/01/2013 05:40 AM, Peter Zijlstra wrote: >>>> On Wed, Jul 31, 2013 at 10:37:10PM -0400, Waiman Long wrote: >>>> >> [...] >>>> >>>>> + */ >>>>> + for (qn_idx = 1; qn_idx< MAX_QNODES; qn_idx++) { >>>>> + if (!node[qn_idx].used) >>>>> + break; >>>> } >>>> >>>>> + if (unlikely(qn_idx == MAX_QNODES)) { >>>>> + /* >>>>> + * This shouldn't happen, print a warning message >>>>> + *& busy spinning on the lock. >>>>> + */ >>>>> + pr_warn("qspinlock: queue node table exhausted at " >>>>> + "cpu %d!\n", cpu_nr); >>>> This could make your machine die hard.. not all contexts can printk(). >>> >>> Do you have any suggestion? I could skip the warning and silently do the >>> busy spinning. I just want some way to notify the user of this rare event. >> >> We have used debugfs in pv-spinlock to avoid that since printk uses >> spinlock again. may be it will help to profile many other parts of >> code too. > > I always use early_printk(), but that requires you set up your serial > console properly and joe-user won't have done that. > Thanks Peter. Yes. this is more convenient. /me remembers using early_printk during pvops patch debugging without setting up serial console ;) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 20:36 ` Raghavendra K T @ 2013-08-01 20:36 ` Raghavendra K T 0 siblings, 0 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-01 20:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On 08/02/2013 01:40 AM, Peter Zijlstra wrote: > On Thu, Aug 01, 2013 at 11:46:24PM +0530, Raghavendra K T wrote: >> On 08/01/2013 11:28 PM, Waiman Long wrote: >>> On 08/01/2013 05:40 AM, Peter Zijlstra wrote: >>>> On Wed, Jul 31, 2013 at 10:37:10PM -0400, Waiman Long wrote: >>>> >> [...] >>>> >>>>> + */ >>>>> + for (qn_idx = 1; qn_idx< MAX_QNODES; qn_idx++) { >>>>> + if (!node[qn_idx].used) >>>>> + break; >>>> } >>>> >>>>> + if (unlikely(qn_idx == MAX_QNODES)) { >>>>> + /* >>>>> + * This shouldn't happen, print a warning message >>>>> + *& busy spinning on the lock. >>>>> + */ >>>>> + pr_warn("qspinlock: queue node table exhausted at " >>>>> + "cpu %d!\n", cpu_nr); >>>> This could make your machine die hard.. not all contexts can printk(). >>> >>> Do you have any suggestion? I could skip the warning and silently do the >>> busy spinning. I just want some way to notify the user of this rare event. >> >> We have used debugfs in pv-spinlock to avoid that since printk uses >> spinlock again. may be it will help to profile many other parts of >> code too. > > I always use early_printk(), but that requires you set up your serial > console properly and joe-user won't have done that. > Thanks Peter. Yes. this is more convenient. /me remembers using early_printk during pvops patch debugging without setting up serial console ;) ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 2:37 ` [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long 2013-08-01 2:37 ` Waiman Long [not found] ` <20130801094029.GK3008@twins.programming.kicks-ass.net> @ 2013-08-01 20:23 ` Raghavendra K T 2013-08-01 20:23 ` Raghavendra K T ` (2 more replies) 2 siblings, 3 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-01 20:23 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin On 08/01/2013 08:07 AM, Waiman Long wrote: > This patch introduces a new queue spinlock implementation that can > serve as an alternative to the default ticket spinlock. Compared > with the ticket spinlock, this queue spinlock should be as fair as > the ticket spinlock. It is faster both in single-thread and in high > contention situations. Only in light to moderate contention where > the average queue depth is around 1-2 will this queue spinlock be > potentially a bit slower due to the higher slowpath overhead. > > This queue spinlock is especially suit to NUMA machines with a large > number of cores as the chance of spinlock contention is much higher > in those machines. > > The idea behind this spinlock implementation is the fact that spinlocks > are acquired with preemption disabled. In other words, the process > will not be migrated to another CPU while it is trying to get a > spinlock. Ignoring interrupt handling, a CPU can only be contending > in one spinlock at any one time. Of course, interrupt handler can try > to acquire one spinlock while the interrupted user process is in the > process of getting another spinlock. By allocating a set of per-cpu > queue nodes and used them to form a waiting queue, we can encode the > queue node address into a much smaller 16-bit size. Together with > the 1-byte lock bit, this queue spinlock implementation will only > need 4 bytes to hold all the information that it needs. > > The current queue node address encoding is as follows: > Bits 0-1 : queue node index in the per-cpu array (4 entries) > Bits 2-16: cpu number + 1 (max cpus = 16383) > > By using also the unused byte, we could support more cpus and queue > node entries in the array at the expense of more coding and probably > performance overheads. > > In the extremely unlikely case that all the queue node entries are > used up, the current code will fall back to busy spinning without > waiting in a queue with warning message. > > With minor modification to the lock fastpath, we could also make a > less fair version of queue spinlock that allows lock stealing. This > can potentially provide better performance for some workloads. > > This patch allows the optional replacement of architecture specific > implementation of ticket spinlock by this generic version of queue > spinlock. Two new config parameters are introduced: > > 1. QUEUE_SPINLOCK > A select-able option that enables the building and replacement of > architecture specific spinlock by queue spinlock. > 2. ARCH_QUEUE_SPINLOCK > Have to be defined in arch/$(ARCH)/Kconfig to enable QUEUE_SPINLOCK > option. This option, by itself, will not enable the queue spinlock > feature. > > For single-thread performance (no contention), a 256K lock/unlock > loop was run on a 2.93Ghz Westmere x86-64 CPU. The following table > shows the average time (in ns) for a single lock/unlock sequence > (including the looping and timing overhead): > > Lock Type Time (ns) > --------- --------- > Ticket spinlock 12.1 > Queue spinlock 10.9 > > So the queue spinlock is slightly faster. > > The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere > x86-64 CPUs to evaluate the performance impact of this patch on the > 3.10.1 kernel. The jobs per minute (JPM) results for the 1100-2000 > user ranges are shown below: > > +------------+--------+--------+--------+--------+--------+--------+ > | Kernel | 3.10.1 | 3.10.1 |%Change | 3.10.1 | 3.10.1 |%Change | > | | | qlock | | | qlock | | > | HT | off | off | | on | on | | > +------------+--------+--------+--------+--------+--------+--------+ > |alltests | 403590 | 397622 | -1.5% | 411479 | 411428 | 0.0% | > |custom | 304581 | 307165 | +0.9% | 230335 | 254798 | +10.6% | > |dbase | 896252 | 900597 | +0.5% |1134659 |1137730 | +0.3% | > |disk | 213679 | 270597 | +26.6% | 203974 | 249998 | +22.6% | > |five_sec | 144012 | 157286 | +9.2% | 132599 | 150812 | +13.7% | > |fserver | 472741 | 467213 | -1.2% | 270601 | 285737 | +5.6% | > |high_systime| 131106 | 139015 | +6.0% | 120294 | 125802 | +4.6% | > |new_dbase | 934871 | 936860 | +0.2 |1177640 |1181679 | +0.3% | > |new_fserver | 452544 | 444967 | -1.7% | 262368 | 275827 | +5.1% | > |shared | 366750 | 369901 | +0.9% | 403645 | 415840 | +3.0% | > |short |1061663 |3003942 |+183.0% |1040392 |2834437 |+172.4% | > +------------+--------+--------+--------+--------+--------+--------+ > > There are cases where the performance drops 1-2%, but the majority > of them are performance gains and some of them are pretty significant. > > For the disk workload, the spinlock bottleneck is the standalone > mb_cache_spinlock in fs/mbcache.c. For the short workload, the spinlock > bottleneck is the d_lock in the dentry structure. > > The following table shows the %time used up by the locks as reported > by the perf command at 1000 users for disk workload & 1500 users for > short workload with HT off. > > -----------------+----------+----------+----------+ > Lock | ticket | qlock | qlock | > | spinlock | fastpath | slowpath | > -----------------+----------+----------+----------+ > Disk w/o patch | 69.0% | - | - | > Disk with patch | - | 0.31% | 73.6% | > Short w/o patch | 74.3% | - | - | > Short with patch | - | 0.72% | 72.3% | > -----------------+----------+----------+----------+ > > There are 2 observations here: > 1. The %CPU time actually increases in the disk workload with the > patch. The fact that most of them are spinning on their own > cacheline with less congestion can probably skew the number up. > > 2. Both the short and disk workloads spend about 70% of their time > in the lock. However, performance improvement is 2.8X vs 1.3X. > This is probably due to the fact that d_lock is embedded next > to the d_count to be updated whereas mb_cache_spinlock is in a > standalone cacheline not shared by the data to be updated. > > Signed-off-by: Waiman Long <Waiman.Long@hp.com> > --- > include/asm-generic/qspinlock.h | 120 ++++++++++++++++++++ > lib/Kconfig | 14 +++ > lib/Makefile | 1 + > lib/qspinlock.c | 237 +++++++++++++++++++++++++++++++++++++++ > 4 files changed, 372 insertions(+), 0 deletions(-) > create mode 100644 include/asm-generic/qspinlock.h > create mode 100644 lib/qspinlock.c > > diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h > new file mode 100644 > index 0000000..4cfb369 > --- /dev/null > +++ b/include/asm-generic/qspinlock.h > @@ -0,0 +1,120 @@ > +/* > + * Queue spinlock > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. > + * > + * Authors: Waiman Long <waiman.long@hp.com> > + */ > +#ifndef __ASM_GENERIC_QSPINLOCK_H > +#define __ASM_GENERIC_QSPINLOCK_H > + > +#include <linux/types.h> > +#include <asm/cmpxchg.h> > +#include <asm/barrier.h> > +#include <asm/processor.h> > + > +#ifdef CONFIG_PARAVIRT_SPINLOCKS > +#define qspinlock arch_spinlock > +#endif > + > +/* > + * The queue spinlock data structure > + */ > +typedef struct qspinlock { > + union { > + struct { > + u8 locked; /* Bit lock */ > + u8 reserved; > + u16 qcode; /* Wait queue code */ > + }; > + u32 qlock; > + }; > +} arch_spinlock_t; > + > +/* > + * External function declarations > + */ > +extern void queue_spin_lock_slowpath(struct qspinlock *lock); > + > +/** > + * queue_spin_is_locked - is the spinlock locked? > + * @lock: Pointer to queue spinlock structure > + * Return: 1 if it is locked, 0 otherwise > + */ > +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) > +{ > + return ACCESS_ONCE(lock->locked); > +} > + > +/** > + * queue_spin_is_contended - check if the lock is contended > + * @lock : Pointer to queue spinlock structure > + * Return: 1 if lock contended, 0 otherwise > + */ > +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) > +{ > + return ACCESS_ONCE(lock->qcode); > +} > +/** > + * queue_spin_trylock - try to acquire the queue spinlock > + * @lock : Pointer to queue spinlock structure > + * Return: 1 if lock acquired, 0 if failed > + */ > +static __always_inline int queue_spin_trylock(struct qspinlock *lock) > +{ > + if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == 0)) > + return 1; > + return 0; > +} > + > +/** > + * queue_spin_lock - acquire a queue spinlock > + * @lock: Pointer to queue spinlock structure > + */ > +static __always_inline void queue_spin_lock(struct qspinlock *lock) > +{ > + if (likely(queue_spin_trylock(lock))) > + return; > + queue_spin_lock_slowpath(lock); > +} quickly falling into slowpath may hurt performance in some cases. no? Instead, I tried something like this: #define SPIN_THRESHOLD 64 static __always_inline void queue_spin_lock(struct qspinlock *lock) { unsigned count = SPIN_THRESHOLD; do { if (likely(queue_spin_trylock(lock))) return; cpu_relax(); } while (count--); queue_spin_lock_slowpath(lock); } Though I could see some gains in overcommit, but it hurted undercommit in some workloads :(. > + > +/** > + * queue_spin_unlock - release a queue spinlock > + * @lock : Pointer to queue spinlock structure > + */ > +static __always_inline void queue_spin_unlock(struct qspinlock *lock) > +{ > + barrier(); > + ACCESS_ONCE(lock->locked) = 0; > + smp_wmb(); > +} > + > +/* > + * Initializier > + */ > +#define __ARCH_SPIN_LOCK_UNLOCKED { { .qlock = 0 } } > + > +#ifndef CONFIG_PARAVIRT_SPINLOCKS > +/* > + * Remapping spinlock architecture specific functions to the corresponding > + * queue spinlock functions. > + */ > +#define arch_spin_is_locked(l) queue_spin_is_locked(l) > +#define arch_spin_is_contended(l) queue_spin_is_contended(l) > +#define arch_spin_lock(l) queue_spin_lock(l) > +#define arch_spin_trylock(l) queue_spin_trylock(l) > +#define arch_spin_unlock(l) queue_spin_unlock(l) > +#define arch_spin_lock_flags(l, f) queue_spin_lock(l) > +#endif > + > +#endif /* __ASM_GENERIC_QSPINLOCK_H */ > diff --git a/lib/Kconfig b/lib/Kconfig > index 35da513..15e0e02 100644 > --- a/lib/Kconfig > +++ b/lib/Kconfig > @@ -412,6 +412,20 @@ config SIGNATURE > Implementation is done using GnuPG MPI library > > # > +# Generic queue spinlock > +# > +config QUEUE_SPINLOCK > + bool "Generic queue spinlock" > + depends on ARCH_QUEUE_SPINLOCK > + default n > + help > + Use an alternative spinlock implementation that has the same > + spinlock structure size but is more optimized for larger > + NUMA systems with a lot of CPU cores. Specifically, waiting > + lock spinners are put to wait in a queue on a local cacheline > + rather than all spinning on the same lock cacheline. > + > +# > # libfdt files, only selected if needed. > # > config LIBFDT > diff --git a/lib/Makefile b/lib/Makefile > index 7baccfd..b53710a 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN $@ > clean-files += oid_registry_data.c > > obj-$(CONFIG_UCS2_STRING) += ucs2_string.o > +obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o > diff --git a/lib/qspinlock.c b/lib/qspinlock.c > new file mode 100644 > index 0000000..21d5f6e > --- /dev/null > +++ b/lib/qspinlock.c > @@ -0,0 +1,237 @@ > +/* > + * Queue spinlock > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. > + * > + * Authors: Waiman Long <waiman.long@hp.com> > + */ > +#include <linux/smp.h> > +#include <linux/bug.h> > +#include <linux/cpumask.h> > +#include <linux/percpu.h> > +#include <linux/hardirq.h> > +#include <asm-generic/qspinlock.h> > + > +/* > + * The queue spinlock fastpath is as simple as it can get, all the heavy > + * lifting is done in the lock slowpath. The main idea behind this queue > + * spinlock implementation is to keep the spinlock size at 4 bytes while > + * at the same time implement a queue structure to queue up the waiting > + * lock spinners. > + * > + * Since preemption is disabled before getting the lock, a given CPU will > + * only need to use one queue node structure in a non-interrupt context. > + * A percpu queue node structure will be allocated for this purpose and the > + * cpu number will be put into the queue spinlock structure to indicate the > + * tail of the queue. > + * > + * To handle spinlock acquisition at interrupt context (softirq or hardirq), > + * the queue node structure is actually an array for supporting nested spin > + * locking operations in interrupt handlers. If all the entries in the > + * array are used up, a warning message will be printed (as that shouldn't > + * happen in normal circumstances) and the lock spinner will fall back to > + * busy spinning instead of waiting in a queue. > + */ > + > +/* > +#ifndef CONFIG_DEBUG_SPINLOCK > +#define CONFIG_DEBUG_SPINLOCK 1 > +#endif > + */ > + > +/* > + * The queue node structure > + */ > +struct qnode { > + struct qnode *next; > + u8 wait; /* Waiting flag */ > + u8 used; /* Used flag */ > +#ifdef CONFIG_DEBUG_SPINLOCK > + u16 cpu_nr; /* CPU number */ > + void *lock; /* Lock address */ > +#endif > +}; > + > +/* > + * The 16-bit wait queue code is divided into the following 2 fields: > + * Bits 0-1 : queue node index > + * Bits 2-15: cpu number + 1 > + * > + * The current implementation will allow a maximum of (1<<14)-1 = 16383 CPUs. > + */ > +#define GET_QN_IDX(code) ((code) & 0x3) > +#define GET_CPU_NR(code) (((code) >> 2) - 1) > +#define SET_CPU_CODE(cpu, idx) ((((cpu) + 1) << 2) | idx) > +#define MAX_QNODES 4 > +#define MAX_CPUS ((1<<14)-1) > + > +/* > + * Per-CPU queue node structures > + */ > +static DEFINE_PER_CPU(struct qnode [MAX_QNODES], qnodes) = { { 0 } }; > + > +/** > + * queue_trylock - try to acquire the lock bit ignoring the qcode in lock > + * @lock: Pointer to queue spinlock structure > + * Return: 1 if lock acquired, 0 if failed > + */ > +static __always_inline int queue_trylock(struct qspinlock *lock) > +{ > + if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0)) > + return 1; > + return 0; > +} It took long time for me to confirm myself that, this is being used when we exhaust all the nodes. But not sure of any better name so that it does not confuse with queue_spin_trylock. anyway, they are in different files :). > + > +/** > + * queue_spin_lock_slowpath - acquire the queue spinlock > + * @lock: Pointer to queue spinlock structure > + */ > +void queue_spin_lock_slowpath(struct qspinlock *lock) > +{ > + unsigned int cpu_nr, qn_idx; > + struct qnode *node, *prev, *next; > + u16 oldcode, newcode; > + > + /* > + * The current code does not work with more than 64K-1 cpus. > + */ > + BUILD_BUG_ON(CONFIG_NR_CPUS >= MAX_CPUS); > + > + /* > + * Get the queue node > + */ > + cpu_nr = smp_processor_id(); > + node = per_cpu_ptr(&qnodes[0], cpu_nr); > + qn_idx = 0; > + > + if (unlikely(node->used)) { > + /* > + * The node has bee used, try to find an empty queue node entry > + */ > + for (qn_idx = 1; qn_idx < MAX_QNODES; qn_idx++) > + if (!node[qn_idx].used) > + break; > + if (unlikely(qn_idx == MAX_QNODES)) { > + /* > + * This shouldn't happen, print a warning message > + * & busy spinning on the lock. > + */ > + pr_warn("qspinlock: queue node table exhausted at " > + "cpu %d!\n", cpu_nr); as discussed, may be debugfs could help in debugging the code, and as well as profiling, how many locks taken in slowpath, etc.. etc.. so that we can get rid of printk s. Result: sandybridge 32 cpu/ 16 core (HT on) 2 node machine with 16 vcpu kvm guests. In general, I am seeing undercommit loads are getting benefited by the patches. base = 3.11-rc1 patched = base + qlock +----+-----------+-----------+-----------+------------+-----------+ hackbench (time in sec lower is better) +----+-----------+-----------+-----------+------------+-----------+ oc base stdev patched stdev %improvement +----+-----------+-----------+-----------+------------+-----------+ 0.5x 18.9326 1.6072 20.0686 2.9968 -6.00023 1.0x 34.0585 5.5120 33.2230 1.6119 2.45313 +----+-----------+-----------+-----------+------------+-----------+ +----+-----------+-----------+-----------+------------+-----------+ ebizzy (records/sec higher is better) +----+-----------+-----------+-----------+------------+-----------+ oc base stdev patched stdev %improvement +----+-----------+-----------+-----------+------------+-----------+ 0.5x 20499.3750 466.7756 22257.8750 884.8308 8.57831 1.0x 15903.5000 271.7126 17993.5000 682.5095 13.14176 1.5x 1883.2222 166.3714 1742.8889 135.2271 -7.45177 2.5x 829.1250 44.3957 803.6250 78.8034 -3.07553 +----+-----------+-----------+-----------+------------+-----------+ +----+-----------+-----------+-----------+------------+-----------+ dbench (Throughput in MB/sec higher is better) +----+-----------+-----------+-----------+------------+-----------+ oc base stdev patched stdev %improvement +----+-----------+-----------+-----------+------------+-----------+ 0.5x 11623.5000 34.2764 11667.0250 47.1122 0.37446 1.0x 6945.3675 79.0642 6798.4950 161.9431 -2.11468 1.5x 3950.4367 27.3828 3910.3122 45.4275 -1.01570 2.0x 2588.2063 35.2058 2520.3412 51.7138 -2.62209 +----+-----------+-----------+-----------+------------+-----------+ I saw dbench results improving to 0.3529, -2.9459, 3.2423, 4.8027 respectively after delaying entering to slowpath above. [...] I have not yet tested on bigger machine. I hope that bigger machine will see significant undercommit improvements. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 20:23 ` Raghavendra K T @ 2013-08-01 20:23 ` Raghavendra K T 2013-08-01 20:47 ` Peter Zijlstra 2013-08-01 21:09 ` Waiman Long 2 siblings, 0 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-01 20:23 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On 08/01/2013 08:07 AM, Waiman Long wrote: > This patch introduces a new queue spinlock implementation that can > serve as an alternative to the default ticket spinlock. Compared > with the ticket spinlock, this queue spinlock should be as fair as > the ticket spinlock. It is faster both in single-thread and in high > contention situations. Only in light to moderate contention where > the average queue depth is around 1-2 will this queue spinlock be > potentially a bit slower due to the higher slowpath overhead. > > This queue spinlock is especially suit to NUMA machines with a large > number of cores as the chance of spinlock contention is much higher > in those machines. > > The idea behind this spinlock implementation is the fact that spinlocks > are acquired with preemption disabled. In other words, the process > will not be migrated to another CPU while it is trying to get a > spinlock. Ignoring interrupt handling, a CPU can only be contending > in one spinlock at any one time. Of course, interrupt handler can try > to acquire one spinlock while the interrupted user process is in the > process of getting another spinlock. By allocating a set of per-cpu > queue nodes and used them to form a waiting queue, we can encode the > queue node address into a much smaller 16-bit size. Together with > the 1-byte lock bit, this queue spinlock implementation will only > need 4 bytes to hold all the information that it needs. > > The current queue node address encoding is as follows: > Bits 0-1 : queue node index in the per-cpu array (4 entries) > Bits 2-16: cpu number + 1 (max cpus = 16383) > > By using also the unused byte, we could support more cpus and queue > node entries in the array at the expense of more coding and probably > performance overheads. > > In the extremely unlikely case that all the queue node entries are > used up, the current code will fall back to busy spinning without > waiting in a queue with warning message. > > With minor modification to the lock fastpath, we could also make a > less fair version of queue spinlock that allows lock stealing. This > can potentially provide better performance for some workloads. > > This patch allows the optional replacement of architecture specific > implementation of ticket spinlock by this generic version of queue > spinlock. Two new config parameters are introduced: > > 1. QUEUE_SPINLOCK > A select-able option that enables the building and replacement of > architecture specific spinlock by queue spinlock. > 2. ARCH_QUEUE_SPINLOCK > Have to be defined in arch/$(ARCH)/Kconfig to enable QUEUE_SPINLOCK > option. This option, by itself, will not enable the queue spinlock > feature. > > For single-thread performance (no contention), a 256K lock/unlock > loop was run on a 2.93Ghz Westmere x86-64 CPU. The following table > shows the average time (in ns) for a single lock/unlock sequence > (including the looping and timing overhead): > > Lock Type Time (ns) > --------- --------- > Ticket spinlock 12.1 > Queue spinlock 10.9 > > So the queue spinlock is slightly faster. > > The AIM7 benchmark was run on a 8-socket 80-core DL980 with Westmere > x86-64 CPUs to evaluate the performance impact of this patch on the > 3.10.1 kernel. The jobs per minute (JPM) results for the 1100-2000 > user ranges are shown below: > > +------------+--------+--------+--------+--------+--------+--------+ > | Kernel | 3.10.1 | 3.10.1 |%Change | 3.10.1 | 3.10.1 |%Change | > | | | qlock | | | qlock | | > | HT | off | off | | on | on | | > +------------+--------+--------+--------+--------+--------+--------+ > |alltests | 403590 | 397622 | -1.5% | 411479 | 411428 | 0.0% | > |custom | 304581 | 307165 | +0.9% | 230335 | 254798 | +10.6% | > |dbase | 896252 | 900597 | +0.5% |1134659 |1137730 | +0.3% | > |disk | 213679 | 270597 | +26.6% | 203974 | 249998 | +22.6% | > |five_sec | 144012 | 157286 | +9.2% | 132599 | 150812 | +13.7% | > |fserver | 472741 | 467213 | -1.2% | 270601 | 285737 | +5.6% | > |high_systime| 131106 | 139015 | +6.0% | 120294 | 125802 | +4.6% | > |new_dbase | 934871 | 936860 | +0.2 |1177640 |1181679 | +0.3% | > |new_fserver | 452544 | 444967 | -1.7% | 262368 | 275827 | +5.1% | > |shared | 366750 | 369901 | +0.9% | 403645 | 415840 | +3.0% | > |short |1061663 |3003942 |+183.0% |1040392 |2834437 |+172.4% | > +------------+--------+--------+--------+--------+--------+--------+ > > There are cases where the performance drops 1-2%, but the majority > of them are performance gains and some of them are pretty significant. > > For the disk workload, the spinlock bottleneck is the standalone > mb_cache_spinlock in fs/mbcache.c. For the short workload, the spinlock > bottleneck is the d_lock in the dentry structure. > > The following table shows the %time used up by the locks as reported > by the perf command at 1000 users for disk workload & 1500 users for > short workload with HT off. > > -----------------+----------+----------+----------+ > Lock | ticket | qlock | qlock | > | spinlock | fastpath | slowpath | > -----------------+----------+----------+----------+ > Disk w/o patch | 69.0% | - | - | > Disk with patch | - | 0.31% | 73.6% | > Short w/o patch | 74.3% | - | - | > Short with patch | - | 0.72% | 72.3% | > -----------------+----------+----------+----------+ > > There are 2 observations here: > 1. The %CPU time actually increases in the disk workload with the > patch. The fact that most of them are spinning on their own > cacheline with less congestion can probably skew the number up. > > 2. Both the short and disk workloads spend about 70% of their time > in the lock. However, performance improvement is 2.8X vs 1.3X. > This is probably due to the fact that d_lock is embedded next > to the d_count to be updated whereas mb_cache_spinlock is in a > standalone cacheline not shared by the data to be updated. > > Signed-off-by: Waiman Long <Waiman.Long@hp.com> > --- > include/asm-generic/qspinlock.h | 120 ++++++++++++++++++++ > lib/Kconfig | 14 +++ > lib/Makefile | 1 + > lib/qspinlock.c | 237 +++++++++++++++++++++++++++++++++++++++ > 4 files changed, 372 insertions(+), 0 deletions(-) > create mode 100644 include/asm-generic/qspinlock.h > create mode 100644 lib/qspinlock.c > > diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h > new file mode 100644 > index 0000000..4cfb369 > --- /dev/null > +++ b/include/asm-generic/qspinlock.h > @@ -0,0 +1,120 @@ > +/* > + * Queue spinlock > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. > + * > + * Authors: Waiman Long <waiman.long@hp.com> > + */ > +#ifndef __ASM_GENERIC_QSPINLOCK_H > +#define __ASM_GENERIC_QSPINLOCK_H > + > +#include <linux/types.h> > +#include <asm/cmpxchg.h> > +#include <asm/barrier.h> > +#include <asm/processor.h> > + > +#ifdef CONFIG_PARAVIRT_SPINLOCKS > +#define qspinlock arch_spinlock > +#endif > + > +/* > + * The queue spinlock data structure > + */ > +typedef struct qspinlock { > + union { > + struct { > + u8 locked; /* Bit lock */ > + u8 reserved; > + u16 qcode; /* Wait queue code */ > + }; > + u32 qlock; > + }; > +} arch_spinlock_t; > + > +/* > + * External function declarations > + */ > +extern void queue_spin_lock_slowpath(struct qspinlock *lock); > + > +/** > + * queue_spin_is_locked - is the spinlock locked? > + * @lock: Pointer to queue spinlock structure > + * Return: 1 if it is locked, 0 otherwise > + */ > +static __always_inline int queue_spin_is_locked(struct qspinlock *lock) > +{ > + return ACCESS_ONCE(lock->locked); > +} > + > +/** > + * queue_spin_is_contended - check if the lock is contended > + * @lock : Pointer to queue spinlock structure > + * Return: 1 if lock contended, 0 otherwise > + */ > +static __always_inline int queue_spin_is_contended(struct qspinlock *lock) > +{ > + return ACCESS_ONCE(lock->qcode); > +} > +/** > + * queue_spin_trylock - try to acquire the queue spinlock > + * @lock : Pointer to queue spinlock structure > + * Return: 1 if lock acquired, 0 if failed > + */ > +static __always_inline int queue_spin_trylock(struct qspinlock *lock) > +{ > + if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == 0)) > + return 1; > + return 0; > +} > + > +/** > + * queue_spin_lock - acquire a queue spinlock > + * @lock: Pointer to queue spinlock structure > + */ > +static __always_inline void queue_spin_lock(struct qspinlock *lock) > +{ > + if (likely(queue_spin_trylock(lock))) > + return; > + queue_spin_lock_slowpath(lock); > +} quickly falling into slowpath may hurt performance in some cases. no? Instead, I tried something like this: #define SPIN_THRESHOLD 64 static __always_inline void queue_spin_lock(struct qspinlock *lock) { unsigned count = SPIN_THRESHOLD; do { if (likely(queue_spin_trylock(lock))) return; cpu_relax(); } while (count--); queue_spin_lock_slowpath(lock); } Though I could see some gains in overcommit, but it hurted undercommit in some workloads :(. > + > +/** > + * queue_spin_unlock - release a queue spinlock > + * @lock : Pointer to queue spinlock structure > + */ > +static __always_inline void queue_spin_unlock(struct qspinlock *lock) > +{ > + barrier(); > + ACCESS_ONCE(lock->locked) = 0; > + smp_wmb(); > +} > + > +/* > + * Initializier > + */ > +#define __ARCH_SPIN_LOCK_UNLOCKED { { .qlock = 0 } } > + > +#ifndef CONFIG_PARAVIRT_SPINLOCKS > +/* > + * Remapping spinlock architecture specific functions to the corresponding > + * queue spinlock functions. > + */ > +#define arch_spin_is_locked(l) queue_spin_is_locked(l) > +#define arch_spin_is_contended(l) queue_spin_is_contended(l) > +#define arch_spin_lock(l) queue_spin_lock(l) > +#define arch_spin_trylock(l) queue_spin_trylock(l) > +#define arch_spin_unlock(l) queue_spin_unlock(l) > +#define arch_spin_lock_flags(l, f) queue_spin_lock(l) > +#endif > + > +#endif /* __ASM_GENERIC_QSPINLOCK_H */ > diff --git a/lib/Kconfig b/lib/Kconfig > index 35da513..15e0e02 100644 > --- a/lib/Kconfig > +++ b/lib/Kconfig > @@ -412,6 +412,20 @@ config SIGNATURE > Implementation is done using GnuPG MPI library > > # > +# Generic queue spinlock > +# > +config QUEUE_SPINLOCK > + bool "Generic queue spinlock" > + depends on ARCH_QUEUE_SPINLOCK > + default n > + help > + Use an alternative spinlock implementation that has the same > + spinlock structure size but is more optimized for larger > + NUMA systems with a lot of CPU cores. Specifically, waiting > + lock spinners are put to wait in a queue on a local cacheline > + rather than all spinning on the same lock cacheline. > + > +# > # libfdt files, only selected if needed. > # > config LIBFDT > diff --git a/lib/Makefile b/lib/Makefile > index 7baccfd..b53710a 100644 > --- a/lib/Makefile > +++ b/lib/Makefile > @@ -187,3 +187,4 @@ quiet_cmd_build_OID_registry = GEN $@ > clean-files += oid_registry_data.c > > obj-$(CONFIG_UCS2_STRING) += ucs2_string.o > +obj-$(CONFIG_QUEUE_SPINLOCK) += qspinlock.o > diff --git a/lib/qspinlock.c b/lib/qspinlock.c > new file mode 100644 > index 0000000..21d5f6e > --- /dev/null > +++ b/lib/qspinlock.c > @@ -0,0 +1,237 @@ > +/* > + * Queue spinlock > + * > + * This program is free software; you can redistribute it and/or modify > + * it under the terms of the GNU General Public License as published by > + * the Free Software Foundation; either version 2 of the License, or > + * (at your option) any later version. > + * > + * This program is distributed in the hope that it will be useful, > + * but WITHOUT ANY WARRANTY; without even the implied warranty of > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > + * GNU General Public License for more details. > + * > + * (C) Copyright 2013 Hewlett-Packard Development Company, L.P. > + * > + * Authors: Waiman Long <waiman.long@hp.com> > + */ > +#include <linux/smp.h> > +#include <linux/bug.h> > +#include <linux/cpumask.h> > +#include <linux/percpu.h> > +#include <linux/hardirq.h> > +#include <asm-generic/qspinlock.h> > + > +/* > + * The queue spinlock fastpath is as simple as it can get, all the heavy > + * lifting is done in the lock slowpath. The main idea behind this queue > + * spinlock implementation is to keep the spinlock size at 4 bytes while > + * at the same time implement a queue structure to queue up the waiting > + * lock spinners. > + * > + * Since preemption is disabled before getting the lock, a given CPU will > + * only need to use one queue node structure in a non-interrupt context. > + * A percpu queue node structure will be allocated for this purpose and the > + * cpu number will be put into the queue spinlock structure to indicate the > + * tail of the queue. > + * > + * To handle spinlock acquisition at interrupt context (softirq or hardirq), > + * the queue node structure is actually an array for supporting nested spin > + * locking operations in interrupt handlers. If all the entries in the > + * array are used up, a warning message will be printed (as that shouldn't > + * happen in normal circumstances) and the lock spinner will fall back to > + * busy spinning instead of waiting in a queue. > + */ > + > +/* > +#ifndef CONFIG_DEBUG_SPINLOCK > +#define CONFIG_DEBUG_SPINLOCK 1 > +#endif > + */ > + > +/* > + * The queue node structure > + */ > +struct qnode { > + struct qnode *next; > + u8 wait; /* Waiting flag */ > + u8 used; /* Used flag */ > +#ifdef CONFIG_DEBUG_SPINLOCK > + u16 cpu_nr; /* CPU number */ > + void *lock; /* Lock address */ > +#endif > +}; > + > +/* > + * The 16-bit wait queue code is divided into the following 2 fields: > + * Bits 0-1 : queue node index > + * Bits 2-15: cpu number + 1 > + * > + * The current implementation will allow a maximum of (1<<14)-1 = 16383 CPUs. > + */ > +#define GET_QN_IDX(code) ((code) & 0x3) > +#define GET_CPU_NR(code) (((code) >> 2) - 1) > +#define SET_CPU_CODE(cpu, idx) ((((cpu) + 1) << 2) | idx) > +#define MAX_QNODES 4 > +#define MAX_CPUS ((1<<14)-1) > + > +/* > + * Per-CPU queue node structures > + */ > +static DEFINE_PER_CPU(struct qnode [MAX_QNODES], qnodes) = { { 0 } }; > + > +/** > + * queue_trylock - try to acquire the lock bit ignoring the qcode in lock > + * @lock: Pointer to queue spinlock structure > + * Return: 1 if lock acquired, 0 if failed > + */ > +static __always_inline int queue_trylock(struct qspinlock *lock) > +{ > + if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0)) > + return 1; > + return 0; > +} It took long time for me to confirm myself that, this is being used when we exhaust all the nodes. But not sure of any better name so that it does not confuse with queue_spin_trylock. anyway, they are in different files :). > + > +/** > + * queue_spin_lock_slowpath - acquire the queue spinlock > + * @lock: Pointer to queue spinlock structure > + */ > +void queue_spin_lock_slowpath(struct qspinlock *lock) > +{ > + unsigned int cpu_nr, qn_idx; > + struct qnode *node, *prev, *next; > + u16 oldcode, newcode; > + > + /* > + * The current code does not work with more than 64K-1 cpus. > + */ > + BUILD_BUG_ON(CONFIG_NR_CPUS >= MAX_CPUS); > + > + /* > + * Get the queue node > + */ > + cpu_nr = smp_processor_id(); > + node = per_cpu_ptr(&qnodes[0], cpu_nr); > + qn_idx = 0; > + > + if (unlikely(node->used)) { > + /* > + * The node has bee used, try to find an empty queue node entry > + */ > + for (qn_idx = 1; qn_idx < MAX_QNODES; qn_idx++) > + if (!node[qn_idx].used) > + break; > + if (unlikely(qn_idx == MAX_QNODES)) { > + /* > + * This shouldn't happen, print a warning message > + * & busy spinning on the lock. > + */ > + pr_warn("qspinlock: queue node table exhausted at " > + "cpu %d!\n", cpu_nr); as discussed, may be debugfs could help in debugging the code, and as well as profiling, how many locks taken in slowpath, etc.. etc.. so that we can get rid of printk s. Result: sandybridge 32 cpu/ 16 core (HT on) 2 node machine with 16 vcpu kvm guests. In general, I am seeing undercommit loads are getting benefited by the patches. base = 3.11-rc1 patched = base + qlock +----+-----------+-----------+-----------+------------+-----------+ hackbench (time in sec lower is better) +----+-----------+-----------+-----------+------------+-----------+ oc base stdev patched stdev %improvement +----+-----------+-----------+-----------+------------+-----------+ 0.5x 18.9326 1.6072 20.0686 2.9968 -6.00023 1.0x 34.0585 5.5120 33.2230 1.6119 2.45313 +----+-----------+-----------+-----------+------------+-----------+ +----+-----------+-----------+-----------+------------+-----------+ ebizzy (records/sec higher is better) +----+-----------+-----------+-----------+------------+-----------+ oc base stdev patched stdev %improvement +----+-----------+-----------+-----------+------------+-----------+ 0.5x 20499.3750 466.7756 22257.8750 884.8308 8.57831 1.0x 15903.5000 271.7126 17993.5000 682.5095 13.14176 1.5x 1883.2222 166.3714 1742.8889 135.2271 -7.45177 2.5x 829.1250 44.3957 803.6250 78.8034 -3.07553 +----+-----------+-----------+-----------+------------+-----------+ +----+-----------+-----------+-----------+------------+-----------+ dbench (Throughput in MB/sec higher is better) +----+-----------+-----------+-----------+------------+-----------+ oc base stdev patched stdev %improvement +----+-----------+-----------+-----------+------------+-----------+ 0.5x 11623.5000 34.2764 11667.0250 47.1122 0.37446 1.0x 6945.3675 79.0642 6798.4950 161.9431 -2.11468 1.5x 3950.4367 27.3828 3910.3122 45.4275 -1.01570 2.0x 2588.2063 35.2058 2520.3412 51.7138 -2.62209 +----+-----------+-----------+-----------+------------+-----------+ I saw dbench results improving to 0.3529, -2.9459, 3.2423, 4.8027 respectively after delaying entering to slowpath above. [...] I have not yet tested on bigger machine. I hope that bigger machine will see significant undercommit improvements. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 20:23 ` Raghavendra K T 2013-08-01 20:23 ` Raghavendra K T @ 2013-08-01 20:47 ` Peter Zijlstra 2013-08-01 20:47 ` Peter Zijlstra 2013-08-02 2:54 ` Raghavendra K T 2013-08-01 21:09 ` Waiman Long 2 siblings, 2 replies; 26+ messages in thread From: Peter Zijlstra @ 2013-08-01 20:47 UTC (permalink / raw) To: Raghavendra K T Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison <harvey.ha> On Fri, Aug 02, 2013 at 01:53:22AM +0530, Raghavendra K T wrote: You need to learn to trim your replies.. I already stopped reading that paravirt thread because of it. Soon I'll introduce you to my /dev/null mail reader. > On 08/01/2013 08:07 AM, Waiman Long wrote: > >+static __always_inline void queue_spin_lock(struct qspinlock *lock) > >+{ > >+ if (likely(queue_spin_trylock(lock))) > >+ return; > >+ queue_spin_lock_slowpath(lock); > >+} > > quickly falling into slowpath may hurt performance in some cases. no? > > Instead, I tried something like this: > > #define SPIN_THRESHOLD 64 > > static __always_inline void queue_spin_lock(struct qspinlock *lock) > { > unsigned count = SPIN_THRESHOLD; > do { > if (likely(queue_spin_trylock(lock))) > return; > cpu_relax(); > } while (count--); > queue_spin_lock_slowpath(lock); > } > > Though I could see some gains in overcommit, but it hurted undercommit > in some workloads :(. This would break the FIFO nature of the lock. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 20:47 ` Peter Zijlstra @ 2013-08-01 20:47 ` Peter Zijlstra 2013-08-02 2:54 ` Raghavendra K T 1 sibling, 0 replies; 26+ messages in thread From: Peter Zijlstra @ 2013-08-01 20:47 UTC (permalink / raw) To: Raghavendra K T Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On Fri, Aug 02, 2013 at 01:53:22AM +0530, Raghavendra K T wrote: You need to learn to trim your replies.. I already stopped reading that paravirt thread because of it. Soon I'll introduce you to my /dev/null mail reader. > On 08/01/2013 08:07 AM, Waiman Long wrote: > >+static __always_inline void queue_spin_lock(struct qspinlock *lock) > >+{ > >+ if (likely(queue_spin_trylock(lock))) > >+ return; > >+ queue_spin_lock_slowpath(lock); > >+} > > quickly falling into slowpath may hurt performance in some cases. no? > > Instead, I tried something like this: > > #define SPIN_THRESHOLD 64 > > static __always_inline void queue_spin_lock(struct qspinlock *lock) > { > unsigned count = SPIN_THRESHOLD; > do { > if (likely(queue_spin_trylock(lock))) > return; > cpu_relax(); > } while (count--); > queue_spin_lock_slowpath(lock); > } > > Though I could see some gains in overcommit, but it hurted undercommit > in some workloads :(. This would break the FIFO nature of the lock. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 20:47 ` Peter Zijlstra 2013-08-01 20:47 ` Peter Zijlstra @ 2013-08-02 2:54 ` Raghavendra K T 2013-08-02 2:54 ` Raghavendra K T 1 sibling, 1 reply; 26+ messages in thread From: Raghavendra K T @ 2013-08-02 2:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison <harvey.ha> On 08/02/2013 02:17 AM, Peter Zijlstra wrote: > On Fri, Aug 02, 2013 at 01:53:22AM +0530, Raghavendra K T wrote: > > You need to learn to trim your replies.. I already stopped reading that > paravirt thread because of it. Soon I'll introduce you to my /dev/null > mail reader. > will be more careful next time. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-02 2:54 ` Raghavendra K T @ 2013-08-02 2:54 ` Raghavendra K T 0 siblings, 0 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-02 2:54 UTC (permalink / raw) To: Peter Zijlstra Cc: Waiman Long, Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On 08/02/2013 02:17 AM, Peter Zijlstra wrote: > On Fri, Aug 02, 2013 at 01:53:22AM +0530, Raghavendra K T wrote: > > You need to learn to trim your replies.. I already stopped reading that > paravirt thread because of it. Soon I'll introduce you to my /dev/null > mail reader. > will be more careful next time. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 20:23 ` Raghavendra K T 2013-08-01 20:23 ` Raghavendra K T 2013-08-01 20:47 ` Peter Zijlstra @ 2013-08-01 21:09 ` Waiman Long 2013-08-01 21:09 ` Waiman Long 2013-08-02 3:00 ` Raghavendra K T 2 siblings, 2 replies; 26+ messages in thread From: Waiman Long @ 2013-08-01 21:09 UTC (permalink / raw) To: Raghavendra K T Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin On 08/01/2013 04:23 PM, Raghavendra K T wrote: > On 08/01/2013 08:07 AM, Waiman Long wrote: >> >> +} >> +/** >> + * queue_spin_trylock - try to acquire the queue spinlock >> + * @lock : Pointer to queue spinlock structure >> + * Return: 1 if lock acquired, 0 if failed >> + */ >> +static __always_inline int queue_spin_trylock(struct qspinlock *lock) >> +{ >> + if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == >> 0)) >> + return 1; >> + return 0; >> +} >> + >> +/** >> + * queue_spin_lock - acquire a queue spinlock >> + * @lock: Pointer to queue spinlock structure >> + */ >> +static __always_inline void queue_spin_lock(struct qspinlock *lock) >> +{ >> + if (likely(queue_spin_trylock(lock))) >> + return; >> + queue_spin_lock_slowpath(lock); >> +} > > quickly falling into slowpath may hurt performance in some cases. no? Failing the trylock means that the process is likely to wait. I do retry one more time in the slowpath before waiting in the queue. > Instead, I tried something like this: > > #define SPIN_THRESHOLD 64 > > static __always_inline void queue_spin_lock(struct qspinlock *lock) > { > unsigned count = SPIN_THRESHOLD; > do { > if (likely(queue_spin_trylock(lock))) > return; > cpu_relax(); > } while (count--); > queue_spin_lock_slowpath(lock); > } > > Though I could see some gains in overcommit, but it hurted undercommit > in some workloads :(. The gcc 4.4.7 compiler that I used in my test machine has the tendency of allocating stack space for variables instead of using registers when a loop is present. So I try to avoid having loop in the fast path. Also the count itself is rather arbitrary. For the first pass, I would like to make thing simple. We can always enhance it once it is accepted and merged. > >> >> +/** >> + * queue_trylock - try to acquire the lock bit ignoring the qcode in >> lock >> + * @lock: Pointer to queue spinlock structure >> + * Return: 1 if lock acquired, 0 if failed >> + */ >> +static __always_inline int queue_trylock(struct qspinlock *lock) >> +{ >> + if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0)) >> + return 1; >> + return 0; >> +} > > It took long time for me to confirm myself that, > this is being used when we exhaust all the nodes. But not sure of > any better name so that it does not confuse with queue_spin_trylock. > anyway, they are in different files :). > Yes, I know it is confusing. I will change the name to make it more explicit. > > Result: > sandybridge 32 cpu/ 16 core (HT on) 2 node machine with 16 vcpu kvm > guests. > > In general, I am seeing undercommit loads are getting benefited by the > patches. > > base = 3.11-rc1 > patched = base + qlock > +----+-----------+-----------+-----------+------------+-----------+ > hackbench (time in sec lower is better) > +----+-----------+-----------+-----------+------------+-----------+ > oc base stdev patched stdev %improvement > +----+-----------+-----------+-----------+------------+-----------+ > 0.5x 18.9326 1.6072 20.0686 2.9968 -6.00023 > 1.0x 34.0585 5.5120 33.2230 1.6119 2.45313 > +----+-----------+-----------+-----------+------------+-----------+ > +----+-----------+-----------+-----------+------------+-----------+ > ebizzy (records/sec higher is better) > +----+-----------+-----------+-----------+------------+-----------+ > oc base stdev patched stdev %improvement > +----+-----------+-----------+-----------+------------+-----------+ > 0.5x 20499.3750 466.7756 22257.8750 884.8308 8.57831 > 1.0x 15903.5000 271.7126 17993.5000 682.5095 13.14176 > 1.5x 1883.2222 166.3714 1742.8889 135.2271 -7.45177 > 2.5x 829.1250 44.3957 803.6250 78.8034 -3.07553 > +----+-----------+-----------+-----------+------------+-----------+ > +----+-----------+-----------+-----------+------------+-----------+ > dbench (Throughput in MB/sec higher is better) > +----+-----------+-----------+-----------+------------+-----------+ > oc base stdev patched stdev %improvement > +----+-----------+-----------+-----------+------------+-----------+ > 0.5x 11623.5000 34.2764 11667.0250 47.1122 0.37446 > 1.0x 6945.3675 79.0642 6798.4950 161.9431 -2.11468 > 1.5x 3950.4367 27.3828 3910.3122 45.4275 -1.01570 > 2.0x 2588.2063 35.2058 2520.3412 51.7138 -2.62209 > +----+-----------+-----------+-----------+------------+-----------+ > > I saw dbench results improving to 0.3529, -2.9459, 3.2423, 4.8027 > respectively after delaying entering to slowpath above. > [...] > > I have not yet tested on bigger machine. I hope that bigger machine will > see significant undercommit improvements. > Thank for running the test. I am a bit confused about the terminology. What exactly do undercommit and overcommit mean? Regards, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 21:09 ` Waiman Long @ 2013-08-01 21:09 ` Waiman Long 2013-08-02 3:00 ` Raghavendra K T 1 sibling, 0 replies; 26+ messages in thread From: Waiman Long @ 2013-08-01 21:09 UTC (permalink / raw) To: Raghavendra K T Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On 08/01/2013 04:23 PM, Raghavendra K T wrote: > On 08/01/2013 08:07 AM, Waiman Long wrote: >> >> +} >> +/** >> + * queue_spin_trylock - try to acquire the queue spinlock >> + * @lock : Pointer to queue spinlock structure >> + * Return: 1 if lock acquired, 0 if failed >> + */ >> +static __always_inline int queue_spin_trylock(struct qspinlock *lock) >> +{ >> + if (!queue_spin_is_contended(lock) && (xchg(&lock->locked, 1) == >> 0)) >> + return 1; >> + return 0; >> +} >> + >> +/** >> + * queue_spin_lock - acquire a queue spinlock >> + * @lock: Pointer to queue spinlock structure >> + */ >> +static __always_inline void queue_spin_lock(struct qspinlock *lock) >> +{ >> + if (likely(queue_spin_trylock(lock))) >> + return; >> + queue_spin_lock_slowpath(lock); >> +} > > quickly falling into slowpath may hurt performance in some cases. no? Failing the trylock means that the process is likely to wait. I do retry one more time in the slowpath before waiting in the queue. > Instead, I tried something like this: > > #define SPIN_THRESHOLD 64 > > static __always_inline void queue_spin_lock(struct qspinlock *lock) > { > unsigned count = SPIN_THRESHOLD; > do { > if (likely(queue_spin_trylock(lock))) > return; > cpu_relax(); > } while (count--); > queue_spin_lock_slowpath(lock); > } > > Though I could see some gains in overcommit, but it hurted undercommit > in some workloads :(. The gcc 4.4.7 compiler that I used in my test machine has the tendency of allocating stack space for variables instead of using registers when a loop is present. So I try to avoid having loop in the fast path. Also the count itself is rather arbitrary. For the first pass, I would like to make thing simple. We can always enhance it once it is accepted and merged. > >> >> +/** >> + * queue_trylock - try to acquire the lock bit ignoring the qcode in >> lock >> + * @lock: Pointer to queue spinlock structure >> + * Return: 1 if lock acquired, 0 if failed >> + */ >> +static __always_inline int queue_trylock(struct qspinlock *lock) >> +{ >> + if (!ACCESS_ONCE(lock->locked) && (xchg(&lock->locked, 1) == 0)) >> + return 1; >> + return 0; >> +} > > It took long time for me to confirm myself that, > this is being used when we exhaust all the nodes. But not sure of > any better name so that it does not confuse with queue_spin_trylock. > anyway, they are in different files :). > Yes, I know it is confusing. I will change the name to make it more explicit. > > Result: > sandybridge 32 cpu/ 16 core (HT on) 2 node machine with 16 vcpu kvm > guests. > > In general, I am seeing undercommit loads are getting benefited by the > patches. > > base = 3.11-rc1 > patched = base + qlock > +----+-----------+-----------+-----------+------------+-----------+ > hackbench (time in sec lower is better) > +----+-----------+-----------+-----------+------------+-----------+ > oc base stdev patched stdev %improvement > +----+-----------+-----------+-----------+------------+-----------+ > 0.5x 18.9326 1.6072 20.0686 2.9968 -6.00023 > 1.0x 34.0585 5.5120 33.2230 1.6119 2.45313 > +----+-----------+-----------+-----------+------------+-----------+ > +----+-----------+-----------+-----------+------------+-----------+ > ebizzy (records/sec higher is better) > +----+-----------+-----------+-----------+------------+-----------+ > oc base stdev patched stdev %improvement > +----+-----------+-----------+-----------+------------+-----------+ > 0.5x 20499.3750 466.7756 22257.8750 884.8308 8.57831 > 1.0x 15903.5000 271.7126 17993.5000 682.5095 13.14176 > 1.5x 1883.2222 166.3714 1742.8889 135.2271 -7.45177 > 2.5x 829.1250 44.3957 803.6250 78.8034 -3.07553 > +----+-----------+-----------+-----------+------------+-----------+ > +----+-----------+-----------+-----------+------------+-----------+ > dbench (Throughput in MB/sec higher is better) > +----+-----------+-----------+-----------+------------+-----------+ > oc base stdev patched stdev %improvement > +----+-----------+-----------+-----------+------------+-----------+ > 0.5x 11623.5000 34.2764 11667.0250 47.1122 0.37446 > 1.0x 6945.3675 79.0642 6798.4950 161.9431 -2.11468 > 1.5x 3950.4367 27.3828 3910.3122 45.4275 -1.01570 > 2.0x 2588.2063 35.2058 2520.3412 51.7138 -2.62209 > +----+-----------+-----------+-----------+------------+-----------+ > > I saw dbench results improving to 0.3529, -2.9459, 3.2423, 4.8027 > respectively after delaying entering to slowpath above. > [...] > > I have not yet tested on bigger machine. I hope that bigger machine will > see significant undercommit improvements. > Thank for running the test. I am a bit confused about the terminology. What exactly do undercommit and overcommit mean? Regards, Longman ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-01 21:09 ` Waiman Long 2013-08-01 21:09 ` Waiman Long @ 2013-08-02 3:00 ` Raghavendra K T 2013-08-02 3:00 ` Raghavendra K T 1 sibling, 1 reply; 26+ messages in thread From: Raghavendra K T @ 2013-08-02 3:00 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin On 08/02/2013 02:39 AM, Waiman Long wrote: > On 08/01/2013 04:23 PM, Raghavendra K T wrote: >> On 08/01/2013 08:07 AM, Waiman Long wrote: [..] >> >> Though I could see some gains in overcommit, but it hurted undercommit >> in some workloads :(. > > The gcc 4.4.7 compiler that I used in my test machine has the tendency > of allocating stack space for variables instead of using registers when > a loop is present. So I try to avoid having loop in the fast path. Also > the count itself is rather arbitrary. For the first pass, I would like > to make thing simple. We can always enhance it once it is accepted and > merged. Yes. agree. >> >> I have not yet tested on bigger machine. I hope that bigger machine will >> see significant undercommit improvements. >> > > Thank for running the test. I am a bit confused about the terminology. > What exactly do undercommit and overcommit mean? > Undercommit means I meant total #vcpu < #pcpus in virtual env. so overcommit should not be an issue in baremetal. ^ permalink raw reply [flat|nested] 26+ messages in thread
* Re: [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation 2013-08-02 3:00 ` Raghavendra K T @ 2013-08-02 3:00 ` Raghavendra K T 0 siblings, 0 replies; 26+ messages in thread From: Raghavendra K T @ 2013-08-02 3:00 UTC (permalink / raw) To: Waiman Long Cc: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J On 08/02/2013 02:39 AM, Waiman Long wrote: > On 08/01/2013 04:23 PM, Raghavendra K T wrote: >> On 08/01/2013 08:07 AM, Waiman Long wrote: [..] >> >> Though I could see some gains in overcommit, but it hurted undercommit >> in some workloads :(. > > The gcc 4.4.7 compiler that I used in my test machine has the tendency > of allocating stack space for variables instead of using registers when > a loop is present. So I try to avoid having loop in the fast path. Also > the count itself is rather arbitrary. For the first pass, I would like > to make thing simple. We can always enhance it once it is accepted and > merged. Yes. agree. >> >> I have not yet tested on bigger machine. I hope that bigger machine will >> see significant undercommit improvements. >> > > Thank for running the test. I am a bit confused about the terminology. > What exactly do undercommit and overcommit mean? > Undercommit means I meant total #vcpu < #pcpus in virtual env. so overcommit should not be an issue in baremetal. ^ permalink raw reply [flat|nested] 26+ messages in thread
* [PATCH RFC 2/2] qspinlock x86: Enable x86 to use queue spinlock [not found] <1375324631-32868-1-git-send-email-Waiman.Long@hp.com> 2013-08-01 2:37 ` [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long @ 2013-08-01 2:37 ` Waiman Long 2013-08-01 2:37 ` Waiman Long 1 sibling, 1 reply; 26+ messages in thread From: Waiman Long @ 2013-08-01 2:37 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin, Harvey Harrison, Chandramouleeswaran, As This patch makes the necessary changes at the x86 architecture specific layer to enable the presence of the CONFIG_QUEUE_SPINLOCK kernel option which can be used to replace the ticket spinlock by the queue spinlock. The turning on of ARCH_QUEUE_SPINLOCK config option does not mean that the ticket spinlock will be replaced. It only means that the architecture supports the replacement. Actual replacement will only happen if the QUEUE_SPINLOCK config option is also set. Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- arch/x86/Kconfig | 3 +++ arch/x86/include/asm/spinlock.h | 2 ++ arch/x86/include/asm/spinlock_types.h | 4 ++++ arch/x86/kernel/paravirt-spinlocks.c | 10 ++++++++++ 4 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b32ebf9..8531e47 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2344,6 +2344,9 @@ config X86_DMA_REMAP bool depends on STA2X11 +config ARCH_QUEUE_SPINLOCK + def_bool y + source "net/Kconfig" source "drivers/Kconfig" diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 33692ea..0ea805d 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -34,6 +34,7 @@ # define UNLOCK_LOCK_PREFIX #endif +#ifndef CONFIG_QUEUE_SPINLOCK /* * Ticket locks are conceptually two parts, one indicating the current head of * the queue, and the other indicating the current tail. The lock is acquired @@ -130,6 +131,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ +#endif /* !CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index ad0ad07..8a7721e 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -7,6 +7,9 @@ #include <linux/types.h> +#ifdef CONFIG_QUEUE_SPINLOCK +# include <asm-generic/qspinlock.h> +#else #if (CONFIG_NR_CPUS < 256) typedef u8 __ticket_t; typedef u16 __ticketpair_t; @@ -27,6 +30,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif #include <asm/rwlock.h> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 676b8c7..75569cb 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -15,6 +15,15 @@ default_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP +# ifdef CONFIG_QUEUE_SPINLOCK + .spin_is_locked = queue_spin_is_locked, + .spin_is_contended = queue_spin_is_contended, + + .spin_lock = queue_spin_lock, + .spin_lock_flags = default_spin_lock_flags, + .spin_trylock = queue_spin_trylock, + .spin_unlock = queue_spin_unlock, +# else .spin_is_locked = __ticket_spin_is_locked, .spin_is_contended = __ticket_spin_is_contended, @@ -22,6 +31,7 @@ struct pv_lock_ops pv_lock_ops = { .spin_lock_flags = default_spin_lock_flags, .spin_trylock = __ticket_spin_trylock, .spin_unlock = __ticket_spin_unlock, +# endif #endif }; EXPORT_SYMBOL(pv_lock_ops); -- 1.7.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
* [PATCH RFC 2/2] qspinlock x86: Enable x86 to use queue spinlock 2013-08-01 2:37 ` [PATCH RFC 2/2] qspinlock x86: Enable x86 to use queue spinlock Waiman Long @ 2013-08-01 2:37 ` Waiman Long 0 siblings, 0 replies; 26+ messages in thread From: Waiman Long @ 2013-08-01 2:37 UTC (permalink / raw) To: Thomas Gleixner, Ingo Molnar, H. Peter Anvin, Arnd Bergmann Cc: Waiman Long, linux-arch, x86, linux-kernel, Peter Zijlstra, Steven Rostedt, Andrew Morton, Richard Weinberger, Catalin Marinas, Greg Kroah-Hartman, Matt Fleming, Herbert Xu, Akinobu Mita, Rusty Russell, Michel Lespinasse, Andi Kleen, Rik van Riel, Paul E. McKenney, Linus Torvalds, Raghavendra K T, George Spelvin, Harvey Harrison, Chandramouleeswaran, Aswin, Norton, Scott J This patch makes the necessary changes at the x86 architecture specific layer to enable the presence of the CONFIG_QUEUE_SPINLOCK kernel option which can be used to replace the ticket spinlock by the queue spinlock. The turning on of ARCH_QUEUE_SPINLOCK config option does not mean that the ticket spinlock will be replaced. It only means that the architecture supports the replacement. Actual replacement will only happen if the QUEUE_SPINLOCK config option is also set. Signed-off-by: Waiman Long <Waiman.Long@hp.com> --- arch/x86/Kconfig | 3 +++ arch/x86/include/asm/spinlock.h | 2 ++ arch/x86/include/asm/spinlock_types.h | 4 ++++ arch/x86/kernel/paravirt-spinlocks.c | 10 ++++++++++ 4 files changed, 19 insertions(+), 0 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index b32ebf9..8531e47 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -2344,6 +2344,9 @@ config X86_DMA_REMAP bool depends on STA2X11 +config ARCH_QUEUE_SPINLOCK + def_bool y + source "net/Kconfig" source "drivers/Kconfig" diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h index 33692ea..0ea805d 100644 --- a/arch/x86/include/asm/spinlock.h +++ b/arch/x86/include/asm/spinlock.h @@ -34,6 +34,7 @@ # define UNLOCK_LOCK_PREFIX #endif +#ifndef CONFIG_QUEUE_SPINLOCK /* * Ticket locks are conceptually two parts, one indicating the current head of * the queue, and the other indicating the current tail. The lock is acquired @@ -130,6 +131,7 @@ static __always_inline void arch_spin_lock_flags(arch_spinlock_t *lock, } #endif /* CONFIG_PARAVIRT_SPINLOCKS */ +#endif /* !CONFIG_QUEUE_SPINLOCK */ static inline void arch_spin_unlock_wait(arch_spinlock_t *lock) { diff --git a/arch/x86/include/asm/spinlock_types.h b/arch/x86/include/asm/spinlock_types.h index ad0ad07..8a7721e 100644 --- a/arch/x86/include/asm/spinlock_types.h +++ b/arch/x86/include/asm/spinlock_types.h @@ -7,6 +7,9 @@ #include <linux/types.h> +#ifdef CONFIG_QUEUE_SPINLOCK +# include <asm-generic/qspinlock.h> +#else #if (CONFIG_NR_CPUS < 256) typedef u8 __ticket_t; typedef u16 __ticketpair_t; @@ -27,6 +30,7 @@ typedef struct arch_spinlock { } arch_spinlock_t; #define __ARCH_SPIN_LOCK_UNLOCKED { { 0 } } +#endif #include <asm/rwlock.h> diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c index 676b8c7..75569cb 100644 --- a/arch/x86/kernel/paravirt-spinlocks.c +++ b/arch/x86/kernel/paravirt-spinlocks.c @@ -15,6 +15,15 @@ default_spin_lock_flags(arch_spinlock_t *lock, unsigned long flags) struct pv_lock_ops pv_lock_ops = { #ifdef CONFIG_SMP +# ifdef CONFIG_QUEUE_SPINLOCK + .spin_is_locked = queue_spin_is_locked, + .spin_is_contended = queue_spin_is_contended, + + .spin_lock = queue_spin_lock, + .spin_lock_flags = default_spin_lock_flags, + .spin_trylock = queue_spin_trylock, + .spin_unlock = queue_spin_unlock, +# else .spin_is_locked = __ticket_spin_is_locked, .spin_is_contended = __ticket_spin_is_contended, @@ -22,6 +31,7 @@ struct pv_lock_ops pv_lock_ops = { .spin_lock_flags = default_spin_lock_flags, .spin_trylock = __ticket_spin_trylock, .spin_unlock = __ticket_spin_unlock, +# endif #endif }; EXPORT_SYMBOL(pv_lock_ops); -- 1.7.1 ^ permalink raw reply related [flat|nested] 26+ messages in thread
end of thread, other threads:[~2013-08-02 2:54 UTC | newest]
Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1375324631-32868-1-git-send-email-Waiman.Long@hp.com>
2013-08-01 2:37 ` [PATCH RFC 1/2] qspinlock: Introducing a 4-byte queue spinlock implementation Waiman Long
2013-08-01 2:37 ` Waiman Long
[not found] ` <20130801094029.GK3008@twins.programming.kicks-ass.net>
2013-08-01 10:11 ` Raghavendra K T
2013-08-01 10:11 ` Raghavendra K T
2013-08-01 10:12 ` Peter Zijlstra
2013-08-01 10:12 ` Peter Zijlstra
2013-08-01 10:14 ` Peter Zijlstra
2013-08-01 10:14 ` Peter Zijlstra
[not found] ` <51FAA1C3.2050507@hp.com>
2013-08-01 18:16 ` Raghavendra K T
2013-08-01 18:16 ` Raghavendra K T
2013-08-01 20:10 ` Peter Zijlstra
2013-08-01 20:10 ` Peter Zijlstra
2013-08-01 20:36 ` Raghavendra K T
2013-08-01 20:36 ` Raghavendra K T
2013-08-01 20:23 ` Raghavendra K T
2013-08-01 20:23 ` Raghavendra K T
2013-08-01 20:47 ` Peter Zijlstra
2013-08-01 20:47 ` Peter Zijlstra
2013-08-02 2:54 ` Raghavendra K T
2013-08-02 2:54 ` Raghavendra K T
2013-08-01 21:09 ` Waiman Long
2013-08-01 21:09 ` Waiman Long
2013-08-02 3:00 ` Raghavendra K T
2013-08-02 3:00 ` Raghavendra K T
2013-08-01 2:37 ` [PATCH RFC 2/2] qspinlock x86: Enable x86 to use queue spinlock Waiman Long
2013-08-01 2:37 ` Waiman Long
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox