* [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning
@ 2016-08-18 21:11 Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier Waiman Long
` (9 more replies)
0 siblings, 10 replies; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
v3->v4:
- Rebased to the latest tip tree due to changes to rwsem-xadd.c.
- Update the OSQ patch to fix race condition.
v2->v3:
- Used smp_acquire__after_ctrl_dep() to provide acquire barrier.
- Added the following new patches:
1) make rwsem_spin_on_owner() return a tristate value.
2) reactivate reader spinning when there is a large number of
favorable writer-on-writer spinnings.
3) move all the rwsem macros in arch-specific rwsem.h files
into a common asm-generic/rwsem_types.h file.
4) add a boot parameter to specify the reader spinning threshold.
- Updated some of the patches as suggested by PeterZ and adjusted
some of the reader spinning parameters.
v1->v2:
- Fixed a 0day build error.
- Added a new patch 1 to make osq_lock() a proper acquire memory
barrier.
- Replaced the explicit enabling of reader spinning by an autotuning
mechanism that disable reader spinning for those rwsems that may
not benefit from reader spinning.
- Remove the last xfs patch as it is no longer necessary.
This patchset enables more aggressive optimistic spinning on
both readers and writers waiting on a writer or reader owned
lock. Spinning on writer is done by looking at the on_cpu flag of the
lock owner. Spinning on readers, on the other hand, is count-based as
there is no easy way to figure out if all the readers are running. The
spinner will stop spinning once the count goes to 0. Because of that,
spinning on readers may hurt performance in some cases.
An autotuning mechanism is used to determine if a rwsem can benefit
from reader optimistic spinning. It will maintain reader spinning as
long as no less than 80% of the spins are successful.
Patch 1 updates the osq_lock() function to make it a proper acquire
memory barrier.
Patch 2 reduces the length of the blocking window after a read locking
attempt where writer lock stealing is disabled because of the active
read lock. It can improve rwsem performance for contended lock. It is
independent of the rest of the patchset.
Patch 3 modifies rwsem_spin_on_owner() to return a tri-state value
that can be used in later patch.
Patch 4 puts in place the autotuning mechanism to check if reader
optimistic spinning should be used or not.
Patch 5 moves down the rwsem_down_read_failed() function for later
patches.
Patch 6 moves the macro definitions in various arch-specific rwsem.h
header files into a commont asm-generic/rwsem_types.h file.
Patch 7 changes RWSEM_WAITING_BIAS to simpify reader trylock code.
Patch 8 enables readers to do optimistic spinning.
Patch 9 allows reactivation of reader spinning when a lot of
writer-on-writer spins are successful.
Patch 10 adds a new boot parameter to change the reader spinning
threshold which can be system specific.
Waiman Long (10):
locking/osq: Make lock/unlock proper acquire/release barrier
locking/rwsem: Stop active read lock ASAP
locking/rwsem: Make rwsem_spin_on_owner() return a tri-state value
locking/rwsem: Enable count-based spinning on reader
locking/rwsem: move down rwsem_down_read_failed function
locking/rwsem: Move common rwsem macros to asm-generic/rwsem_types.h
locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation
locking/rwsem: Enable spinning readers
locking/rwsem: Enable reactivation of reader spinning
locking/rwsem: Add a boot parameter to reader spinning threshold
Documentation/kernel-parameters.txt | 3 +
arch/alpha/include/asm/rwsem.h | 11 +-
arch/ia64/include/asm/rwsem.h | 9 +-
arch/s390/include/asm/rwsem.h | 9 +-
arch/x86/include/asm/rwsem.h | 22 +---
include/asm-generic/rwsem.h | 20 +--
include/asm-generic/rwsem_types.h | 28 ++++
include/linux/rwsem.h | 23 +++-
kernel/locking/osq_lock.c | 24 ++-
kernel/locking/rwsem-xadd.c | 296 ++++++++++++++++++++++++++---------
10 files changed, 308 insertions(+), 137 deletions(-)
create mode 100644 include/asm-generic/rwsem_types.h
^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-10-04 19:06 ` Davidlohr Bueso
2016-08-18 21:11 ` [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP Waiman Long
` (8 subsequent siblings)
9 siblings, 1 reply; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
The osq_lock() and osq_unlock() function may not provide the necessary
acquire and release barrier in some cases. This patch makes sure
that the proper barriers are provided when osq_lock() is successful
or when osq_unlock() is called.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
kernel/locking/osq_lock.c | 24 ++++++++++++++++++------
1 files changed, 18 insertions(+), 6 deletions(-)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..3da0b97 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
cpu_relax_lowlatency();
}
+ /*
+ * Add an acquire memory barrier for pairing with the release barrier
+ * in unlock.
+ */
+ smp_acquire__after_ctrl_dep();
return true;
unqueue:
@@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock)
* Second most likely case.
*/
node = this_cpu_ptr(&osq_node);
- next = xchg(&node->next, NULL);
- if (next) {
- WRITE_ONCE(next->locked, 1);
+ next = xchg_relaxed(&node->next, NULL);
+ if (next)
+ goto unlock;
+
+ next = osq_wait_next(lock, node, NULL);
+ if (unlikely(!next)) {
+ /*
+ * In the unlikely event that the OSQ is empty, we need to
+ * provide a proper release barrier.
+ */
+ smp_mb();
return;
}
- next = osq_wait_next(lock, node, NULL);
- if (next)
- WRITE_ONCE(next->locked, 1);
+unlock:
+ smp_store_release(&next->locked, 1);
}
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-10-06 18:17 ` Davidlohr Bueso
2016-08-18 21:11 ` [RFC PATCH-tip v4 03/10] locking/rwsem: Make rwsem_spin_on_owner() return a tri-state value Waiman Long
` (7 subsequent siblings)
9 siblings, 1 reply; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
Currently, when down_read() fails, the active read locking isn't undone
until the rwsem_down_read_failed() function grabs the wait_lock. If the
wait_lock is contended, it may takes a while to get the lock. During
that period, writer lock stealing will be disabled because of the
active read lock.
This patch will release the active read lock ASAP so that writer lock
stealing can happen sooner. The only downside is when the reader is
the first one in the wait queue as it has to issue another atomic
operation to update the count.
On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
the fio test with multithreaded randrw and randwrite tests on the
same file on a XFS partition on top of a NVDIMM with DAX were run,
the aggregated bandwidths before and after the patch were as follows:
Test BW before patch BW after patch % change
---- --------------- -------------- --------
randrw 1210 MB/s 1352 MB/s +12%
randwrite 1622 MB/s 1710 MB/s +5.4%
The write-only microbench also showed improvement because some read
locking was done by the XFS code.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
kernel/locking/rwsem-xadd.c | 21 +++++++++++++++------
1 files changed, 15 insertions(+), 6 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..9309e72 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -222,21 +222,31 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
__visible
struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
{
- long count, adjustment = -RWSEM_ACTIVE_READ_BIAS;
+ long count, adjustment = 0;
struct rwsem_waiter waiter;
struct task_struct *tsk = current;
WAKE_Q(wake_q);
+ /*
+ * Undo read bias from down_read operation to stop active locking.
+ * Doing that after taking the wait_lock may block writer lock
+ * stealing for too long impacting system performance.
+ */
+ atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
+
waiter.task = tsk;
waiter.type = RWSEM_WAITING_FOR_READ;
raw_spin_lock_irq(&sem->wait_lock);
if (list_empty(&sem->wait_list))
- adjustment += RWSEM_WAITING_BIAS;
+ adjustment = RWSEM_WAITING_BIAS;
list_add_tail(&waiter.list, &sem->wait_list);
- /* we're now waiting on the lock, but no longer actively locking */
- count = atomic_long_add_return(adjustment, &sem->count);
+ /* we're now waiting on the lock */
+ if (adjustment)
+ count = atomic_long_add_return(adjustment, &sem->count);
+ else
+ count = atomic_long_read(&sem->count);
/*
* If there are no active locks, wake the front queued process(es).
@@ -245,8 +255,7 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
* wake our own waiter to join the existing active readers !
*/
if (count = RWSEM_WAITING_BIAS ||
- (count > RWSEM_WAITING_BIAS &&
- adjustment != -RWSEM_ACTIVE_READ_BIAS))
+ (count > RWSEM_WAITING_BIAS && adjustment))
__rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
raw_spin_unlock_irq(&sem->wait_lock);
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 03/10] locking/rwsem: Make rwsem_spin_on_owner() return a tri-state value
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 04/10] locking/rwsem: Enable count-based spinning on reader Waiman Long
` (6 subsequent siblings)
9 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
This patch modifies rwsem_spin_on_owner() to return a tri-state value
to better reflect the state of lock holder which enables us to make a
better decision of what to do next.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
kernel/locking/rwsem-xadd.c | 14 +++++++++-----
1 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 9309e72..31cb286 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -352,9 +352,13 @@ done:
}
/*
- * Return true only if we can still spin on the owner field of the rwsem.
+ * Return the folowing three values depending on the lock owner state.
+ * 1 when owner has changed and no reader is detected yet.
+ * 0 when owner has change and/or owner is a reader.
+ * -1 when optimistic spinning has to stop because either the owner stops
+ * running or its timeslice has been used up.
*/
-static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
+static noinline int rwsem_spin_on_owner(struct rw_semaphore *sem)
{
struct task_struct *owner = READ_ONCE(sem->owner);
@@ -374,7 +378,7 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
/* abort spinning when need_resched or owner is not running */
if (!owner->on_cpu || need_resched()) {
rcu_read_unlock();
- return false;
+ return -1;
}
cpu_relax_lowlatency();
@@ -385,7 +389,7 @@ out:
* If there is a new owner or the owner is not set, we continue
* spinning.
*/
- return !rwsem_owner_is_reader(READ_ONCE(sem->owner));
+ return rwsem_owner_is_reader(READ_ONCE(sem->owner)) ? 0 : 1;
}
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
@@ -408,7 +412,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
* 2) readers own the lock as we can't determine if they are
* actively running or not.
*/
- while (rwsem_spin_on_owner(sem)) {
+ while (rwsem_spin_on_owner(sem) > 0) {
/*
* Try to acquire the lock
*/
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 04/10] locking/rwsem: Enable count-based spinning on reader
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
` (2 preceding siblings ...)
2016-08-18 21:11 ` [RFC PATCH-tip v4 03/10] locking/rwsem: Make rwsem_spin_on_owner() return a tri-state value Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 05/10] locking/rwsem: move down rwsem_down_read_failed function Waiman Long
` (5 subsequent siblings)
9 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
When the rwsem is owned by reader, writers stop optimistic spinning
simply because there is no easy way to figure out if all the readers
are actively running or not. However, there are scenarios where
the readers are unlikely to sleep and optimistic spinning can help
performance.
This patch provides an autotuning mechanism to find out if a rwsem
can benefit from count-based reader optimistic spinning. A count
(rspin_enabled) in the rwsem data structure is used to track if
optimistic spinning should be enabled. Reader spinning is enabled
by default. Each successful spin (with lock acquisition) will
increment the count by 1 and each unsuccessful spin will decrement
it by 4. When the count reaches 0, reader spinning is disabled.
Modification of that count is protected by the osq lock. Therefore,
reader spinning will be maintained as long as at least 80% of the
spins are successful.
Both the spinning threshold and the default value for rspin_enabled
can be overridden by architecture specific rwsem.h header file.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
include/linux/rwsem.h | 19 +++++++++++-
kernel/locking/rwsem-xadd.c | 66 ++++++++++++++++++++++++++++++++++++++----
2 files changed, 77 insertions(+), 8 deletions(-)
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index dd1d142..8978f87 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -32,6 +32,8 @@ struct rw_semaphore {
raw_spinlock_t wait_lock;
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
struct optimistic_spin_queue osq; /* spinner MCS lock */
+ int rspin_enabled; /* protected by osq lock */
+
/*
* Write owner. Used as a speculative check to see
* if the owner is running on the cpu.
@@ -69,8 +71,23 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
# define __RWSEM_DEP_MAP_INIT(lockname)
#endif
+/*
+ * Each successful reader spin will increment the rspin_enabled by 1.
+ * Each unsuccessful spin, on the other hand, will decrement it by 2.
+ * Reader spinning will be permanently disabled when it reaches 0.
+ */
+#ifndef RWSEM_RSPIN_ENABLED_DEFAULT
+# define RWSEM_RSPIN_ENABLED_DEFAULT 40
+#endif
+#define RWSEM_RSPIN_ENABLED_MAX 1024
+
+#ifndef RWSEM_RSPIN_THRESHOLD
+# define RWSEM_RSPIN_THRESHOLD (1 << 12)
+#endif
+
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
-#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL
+#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL, \
+ .rspin_enabled = RWSEM_RSPIN_ENABLED_DEFAULT
#else
#define __RWSEM_OPT_INIT(lockname)
#endif
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 31cb286..4654624 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -85,6 +85,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
INIT_LIST_HEAD(&sem->wait_list);
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
sem->owner = NULL;
+ sem->rspin_enabled = RWSEM_RSPIN_ENABLED_DEFAULT;
osq_lock_init(&sem->osq);
#endif
}
@@ -339,9 +340,10 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
owner = READ_ONCE(sem->owner);
if (!rwsem_owner_is_writer(owner)) {
/*
- * Don't spin if the rwsem is readers owned.
+ * Don't spin if the rwsem is readers owned and the
+ * reader spinning threshold isn't set.
*/
- ret = !rwsem_owner_is_reader(owner);
+ ret = !rwsem_owner_is_reader(owner) || sem->rspin_enabled;
goto done;
}
@@ -395,6 +397,8 @@ out:
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
bool taken = false;
+ int owner_state; /* Lock owner state */
+ int rspin_cnt; /* Count for reader spinning */
preempt_disable();
@@ -405,14 +409,16 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
if (!osq_lock(&sem->osq))
goto done;
+ rspin_cnt = sem->rspin_enabled ? RWSEM_RSPIN_THRESHOLD : 0;
+
/*
* Optimistically spin on the owner field and attempt to acquire the
* lock whenever the owner changes. Spinning will be stopped when:
- * 1) the owning writer isn't running; or
- * 2) readers own the lock as we can't determine if they are
- * actively running or not.
+ * 1) the owning writer isn't running,
+ * 2) readers own the lock and reader spinning count has reached 0; or
+ * 3) its timeslice has been used up.
*/
- while (rwsem_spin_on_owner(sem) > 0) {
+ while ((owner_state = rwsem_spin_on_owner(sem)) >= 0) {
/*
* Try to acquire the lock
*/
@@ -422,12 +428,24 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
}
/*
+ * We only decremnt the rspin_cnt when the lock is owned
+ * by readers (owner_state = 0). In which case,
+ * rwsem_spin_on_owner() will essentially be a no-op
+ * and we will be spinning in this main loop.
+ */
+ if (owner_state = 0) {
+ if (!rspin_cnt)
+ break;
+ rspin_cnt--;
+ }
+
+ /*
* When there's no owner, we might have preempted between the
* owner acquiring the lock and setting the owner field. If
* we're an RT task that will live-lock because we won't let
* the owner complete.
*/
- if (!sem->owner && (need_resched() || rt_task(current)))
+ if (!sem->owner && rt_task(current))
break;
/*
@@ -438,6 +456,28 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
*/
cpu_relax_lowlatency();
}
+ /*
+ * Check the success or failure of writer spinning on reader so as
+ * to adjust the rspin_enabled count accordingly.
+ */
+ if (rwsem_owner_is_reader(sem->owner)) {
+ /*
+ * Update rspin_enabled for reader spinning.
+ *
+ * Right now, we need more than 2/3 successful spins to
+ * maintain reader spinning. We will get rid of it if we don't
+ * have enough successful spins. The decrement amount is kind
+ * of arbitrary and can be adjusted if necessary.
+ */
+ if (taken && (sem->rspin_enabled < RWSEM_RSPIN_ENABLED_MAX)) {
+ sem->rspin_enabled++;
+ } else if (!taken) {
+ if (sem->rspin_enabled > 2)
+ sem->rspin_enabled -= 2;
+ else
+ sem->rspin_enabled = 0;
+ }
+ }
osq_unlock(&sem->osq);
done:
preempt_enable();
@@ -452,6 +492,13 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
return osq_is_locked(&sem->osq);
}
+/*
+ * Return true if reader optimistic spinning is enabled
+ */
+static inline bool reader_spinning_enabled(struct rw_semaphore *sem)
+{
+ return sem->rspin_enabled;
+}
#else
static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
{
@@ -462,6 +509,11 @@ static inline bool rwsem_has_spinner(struct rw_semaphore *sem)
{
return false;
}
+
+static inline bool reader_spinning_enabled(struct rw_semaphore *sem)
+{
+ return false;
+}
#endif
/*
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 05/10] locking/rwsem: move down rwsem_down_read_failed function
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
` (3 preceding siblings ...)
2016-08-18 21:11 ` [RFC PATCH-tip v4 04/10] locking/rwsem: Enable count-based spinning on reader Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 06/10] locking/rwsem: Move common rwsem macros to asm-generic/rwsem_types.h Waiman Long
` (4 subsequent siblings)
9 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
Move the rwsem_down_read_failed() function down to below the
optimistic spinning section before enabling optimistic spinning for
the readers. It is because the rwsem_down_read_failed() function will
call rwsem_optimistic_spin() in later patch.
There is no change in code.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
kernel/locking/rwsem-xadd.c | 116 +++++++++++++++++++++---------------------
1 files changed, 58 insertions(+), 58 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 4654624..6b5bd6e 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -218,64 +218,6 @@ static void __rwsem_mark_wake(struct rw_semaphore *sem,
}
/*
- * Wait for the read lock to be granted
- */
-__visible
-struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
-{
- long count, adjustment = 0;
- struct rwsem_waiter waiter;
- struct task_struct *tsk = current;
- WAKE_Q(wake_q);
-
- /*
- * Undo read bias from down_read operation to stop active locking.
- * Doing that after taking the wait_lock may block writer lock
- * stealing for too long impacting system performance.
- */
- atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
-
- waiter.task = tsk;
- waiter.type = RWSEM_WAITING_FOR_READ;
-
- raw_spin_lock_irq(&sem->wait_lock);
- if (list_empty(&sem->wait_list))
- adjustment = RWSEM_WAITING_BIAS;
- list_add_tail(&waiter.list, &sem->wait_list);
-
- /* we're now waiting on the lock */
- if (adjustment)
- count = atomic_long_add_return(adjustment, &sem->count);
- else
- count = atomic_long_read(&sem->count);
-
- /*
- * If there are no active locks, wake the front queued process(es).
- *
- * If there are no writers and we are first in the queue,
- * wake our own waiter to join the existing active readers !
- */
- if (count = RWSEM_WAITING_BIAS ||
- (count > RWSEM_WAITING_BIAS && adjustment))
- __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
-
- raw_spin_unlock_irq(&sem->wait_lock);
- wake_up_q(&wake_q);
-
- /* wait to be given the lock */
- while (true) {
- set_task_state(tsk, TASK_UNINTERRUPTIBLE);
- if (!waiter.task)
- break;
- schedule();
- }
-
- __set_task_state(tsk, TASK_RUNNING);
- return sem;
-}
-EXPORT_SYMBOL(rwsem_down_read_failed);
-
-/*
* This function must be called with the sem->wait_lock held to prevent
* race conditions between checking the rwsem wait list and setting the
* sem->count accordingly.
@@ -517,6 +459,64 @@ static inline bool reader_spinning_enabled(struct rw_semaphore *sem)
#endif
/*
+ * Wait for the read lock to be granted
+ */
+__visible
+struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
+{
+ long count, adjustment = 0;
+ struct rwsem_waiter waiter;
+ struct task_struct *tsk = current;
+ WAKE_Q(wake_q);
+
+ /*
+ * Undo read bias from down_read operation to stop active locking.
+ * Doing that after taking the wait_lock may block writer lock
+ * stealing for too long impacting system performance.
+ */
+ atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
+
+ waiter.task = tsk;
+ waiter.type = RWSEM_WAITING_FOR_READ;
+
+ raw_spin_lock_irq(&sem->wait_lock);
+ if (list_empty(&sem->wait_list))
+ adjustment = RWSEM_WAITING_BIAS;
+ list_add_tail(&waiter.list, &sem->wait_list);
+
+ /* we're now waiting on the lock */
+ if (adjustment)
+ count = atomic_long_add_return(adjustment, &sem->count);
+ else
+ count = atomic_long_read(&sem->count);
+
+ /*
+ * If there are no active locks, wake the front queued process(es).
+ *
+ * If there are no writers and we are first in the queue,
+ * wake our own waiter to join the existing active readers !
+ */
+ if (count = RWSEM_WAITING_BIAS ||
+ (count > RWSEM_WAITING_BIAS && adjustment))
+ __rwsem_mark_wake(sem, RWSEM_WAKE_ANY, &wake_q);
+
+ raw_spin_unlock_irq(&sem->wait_lock);
+ wake_up_q(&wake_q);
+
+ /* wait to be given the lock */
+ while (true) {
+ set_task_state(tsk, TASK_UNINTERRUPTIBLE);
+ if (!waiter.task)
+ break;
+ schedule();
+ }
+
+ __set_task_state(tsk, TASK_RUNNING);
+ return sem;
+}
+EXPORT_SYMBOL(rwsem_down_read_failed);
+
+/*
* Wait until we successfully acquire the write lock
*/
static inline struct rw_semaphore *
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 06/10] locking/rwsem: Move common rwsem macros to asm-generic/rwsem_types.h
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
` (4 preceding siblings ...)
2016-08-18 21:11 ` [RFC PATCH-tip v4 05/10] locking/rwsem: move down rwsem_down_read_failed function Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 07/10] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation Waiman Long
` (3 subsequent siblings)
9 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
Almost all the macro definitions in the various architecture specific
rwsem.h header files are essentially the same. This patch moves all
of them into a common header asm-generic/rwsem_types.h to eliminate
the duplication.
Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
arch/alpha/include/asm/rwsem.h | 8 +-------
arch/ia64/include/asm/rwsem.h | 7 ++-----
arch/s390/include/asm/rwsem.h | 7 +------
arch/x86/include/asm/rwsem.h | 19 +------------------
include/asm-generic/rwsem.h | 16 +---------------
include/asm-generic/rwsem_types.h | 26 ++++++++++++++++++++++++++
6 files changed, 32 insertions(+), 51 deletions(-)
create mode 100644 include/asm-generic/rwsem_types.h
diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
index 77873d0..f99e39a 100644
--- a/arch/alpha/include/asm/rwsem.h
+++ b/arch/alpha/include/asm/rwsem.h
@@ -13,13 +13,7 @@
#ifdef __KERNEL__
#include <linux/compiler.h>
-
-#define RWSEM_UNLOCKED_VALUE 0x0000000000000000L
-#define RWSEM_ACTIVE_BIAS 0x0000000000000001L
-#define RWSEM_ACTIVE_MASK 0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS (-0x0000000100000000L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#include <asm-generic/rwsem_types.h>
static inline void __down_read(struct rw_semaphore *sem)
{
diff --git a/arch/ia64/include/asm/rwsem.h b/arch/ia64/include/asm/rwsem.h
index 8fa98dd..21a9066 100644
--- a/arch/ia64/include/asm/rwsem.h
+++ b/arch/ia64/include/asm/rwsem.h
@@ -26,13 +26,10 @@
#endif
#include <asm/intrinsics.h>
+#include <asm-generic/rwsem_types.h>
+#undef RWSEM_UNLOCKED_VALUE
#define RWSEM_UNLOCKED_VALUE __IA64_UL_CONST(0x0000000000000000)
-#define RWSEM_ACTIVE_BIAS (1L)
-#define RWSEM_ACTIVE_MASK (0xffffffffL)
-#define RWSEM_WAITING_BIAS (-0x100000000L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
/*
* lock for reading
diff --git a/arch/s390/include/asm/rwsem.h b/arch/s390/include/asm/rwsem.h
index 597e7e9..13dedc8 100644
--- a/arch/s390/include/asm/rwsem.h
+++ b/arch/s390/include/asm/rwsem.h
@@ -39,12 +39,7 @@
#error "please don't include asm/rwsem.h directly, use linux/rwsem.h instead"
#endif
-#define RWSEM_UNLOCKED_VALUE 0x0000000000000000L
-#define RWSEM_ACTIVE_BIAS 0x0000000000000001L
-#define RWSEM_ACTIVE_MASK 0x00000000ffffffffL
-#define RWSEM_WAITING_BIAS (-0x0000000100000000L)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#include <asm-generic/rwsem_types.h>
/*
* lock for reading
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index 8dbc762..da84726 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -38,24 +38,7 @@
#ifdef __KERNEL__
#include <asm/asm.h>
-
-/*
- * The bias values and the counter type limits the number of
- * potential readers/writers to 32767 for 32 bits and 2147483647
- * for 64 bits.
- */
-
-#ifdef CONFIG_X86_64
-# define RWSEM_ACTIVE_MASK 0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK 0x0000ffffL
-#endif
-
-#define RWSEM_UNLOCKED_VALUE 0x00000000L
-#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#include <asm-generic/rwsem_types.h>
/*
* lock for reading
diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index 5be122e..3cb8d98 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -12,21 +12,7 @@
* Adapted largely from include/asm-i386/rwsem.h
* by Paul Mackerras <paulus@samba.org>.
*/
-
-/*
- * the semaphore definition
- */
-#ifdef CONFIG_64BIT
-# define RWSEM_ACTIVE_MASK 0xffffffffL
-#else
-# define RWSEM_ACTIVE_MASK 0x0000ffffL
-#endif
-
-#define RWSEM_UNLOCKED_VALUE 0x00000000L
-#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
-#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#include <asm-generic/rwsem_types.h>
/*
* lock for reading
diff --git a/include/asm-generic/rwsem_types.h b/include/asm-generic/rwsem_types.h
new file mode 100644
index 0000000..093ef6a
--- /dev/null
+++ b/include/asm-generic/rwsem_types.h
@@ -0,0 +1,26 @@
+#ifndef _ASM_GENERIC_RWSEM_TYPES_H
+#define _ASM_GENERIC_RWSEM_TYPES_H
+
+#ifdef __KERNEL__
+
+/*
+ * the semaphore definition
+ *
+ * The bias values and the counter type limits the number of
+ * potential readers/writers to 32767 for 32 bits and 2147483647
+ * for 64 bits.
+ */
+#ifdef CONFIG_64BIT
+# define RWSEM_ACTIVE_MASK 0xffffffffL
+#else
+# define RWSEM_ACTIVE_MASK 0x0000ffffL
+#endif
+
+#define RWSEM_UNLOCKED_VALUE 0x00000000L
+#define RWSEM_ACTIVE_BIAS 0x00000001L
+#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
+#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
+#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+
+#endif /* __KERNEL__ */
+#endif /* _ASM_GENERIC_RWSEM_TYPES_H */
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 07/10] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
` (5 preceding siblings ...)
2016-08-18 21:11 ` [RFC PATCH-tip v4 06/10] locking/rwsem: Move common rwsem macros to asm-generic/rwsem_types.h Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-08-19 5:57 ` Wanpeng Li
2016-08-18 21:11 ` [RFC PATCH-tip v4 08/10] locking/rwsem: Enable spinning readers Waiman Long
` (2 subsequent siblings)
9 siblings, 1 reply; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
When the count value is in between 0 and RWSEM_WAITING_BIAS, there
are 2 possibilities. Either a writer is present and there is no
waiter or there are waiters and readers. There is no easy way to
know which is true unless the wait_lock is taken.
This patch changes the RWSEM_WAITING_BIAS from 0xffff (32-bit) or
0xffffffff (64-bit) to 0xc0000000 (32-bit) or 0xc000000000000000
(64-bit). By doing so, we will be able to determine if writers
are present by looking at the count value alone without taking the
wait_lock.
This patch has the effect of halving the maximum number of writers
that can attempt to take the write lock simultaneously. However,
even the reduced maximum of about 16k (32-bit) or 1G (64-bit) should
be more than enough for the foreseeable future.
With that change, the following identity is now no longer true:
RWSEM_ACTIVE_WRITE_BIAS = RWSEM_WAITING_BIAS + RWSEM_ACTIVE_READ_BIAS
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
arch/alpha/include/asm/rwsem.h | 3 ++-
arch/ia64/include/asm/rwsem.h | 2 +-
arch/s390/include/asm/rwsem.h | 2 +-
arch/x86/include/asm/rwsem.h | 3 ++-
include/asm-generic/rwsem.h | 4 ++--
include/asm-generic/rwsem_types.h | 10 ++++++----
kernel/locking/rwsem-xadd.c | 32 ++++++++++++++++++++++++--------
7 files changed, 38 insertions(+), 18 deletions(-)
diff --git a/arch/alpha/include/asm/rwsem.h b/arch/alpha/include/asm/rwsem.h
index f99e39a..dc236a5 100644
--- a/arch/alpha/include/asm/rwsem.h
+++ b/arch/alpha/include/asm/rwsem.h
@@ -179,7 +179,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
"2: br 1b\n"
".previous"
:"=&r" (oldcount), "=m" (sem->count), "=&r" (temp)
- :"Ir" (-RWSEM_WAITING_BIAS), "m" (sem->count) : "memory");
+ :"Ir" (-RWSEM_ACTIVE_WRITE_BIAS + RWSEM_ACTIVE_READ_BIAS),
+ "m" (sem->count) : "memory");
#endif
if (unlikely(oldcount < 0))
rwsem_downgrade_wake(sem);
diff --git a/arch/ia64/include/asm/rwsem.h b/arch/ia64/include/asm/rwsem.h
index 21a9066..ecea341 100644
--- a/arch/ia64/include/asm/rwsem.h
+++ b/arch/ia64/include/asm/rwsem.h
@@ -141,7 +141,7 @@ __downgrade_write (struct rw_semaphore *sem)
do {
old = atomic_long_read(&sem->count);
- new = old - RWSEM_WAITING_BIAS;
+ new = old - RWSEM_ACTIVE_WRITE_BIAS + RWSEM_ACTIVE_READ_BIAS;
} while (atomic_long_cmpxchg_release(&sem->count, old, new) != old);
if (old < 0)
diff --git a/arch/s390/include/asm/rwsem.h b/arch/s390/include/asm/rwsem.h
index 13dedc8..e675a64 100644
--- a/arch/s390/include/asm/rwsem.h
+++ b/arch/s390/include/asm/rwsem.h
@@ -188,7 +188,7 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
{
signed long old, new, tmp;
- tmp = -RWSEM_WAITING_BIAS;
+ tmp = -RWSEM_ACTIVE_WRITE_BIAS + RWSEM_ACTIVE_READ_BIAS;
asm volatile(
" lg %0,%2\n"
"0: lgr %1,%0\n"
diff --git a/arch/x86/include/asm/rwsem.h b/arch/x86/include/asm/rwsem.h
index da84726..1edea1a 100644
--- a/arch/x86/include/asm/rwsem.h
+++ b/arch/x86/include/asm/rwsem.h
@@ -193,7 +193,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
"1:\n\t"
"# ending __downgrade_write\n"
: "+m" (sem->count)
- : "a" (sem), "er" (-RWSEM_WAITING_BIAS)
+ : "a" (sem), "er" (-RWSEM_ACTIVE_WRITE_BIAS +
+ RWSEM_ACTIVE_READ_BIAS)
: "memory", "cc");
}
diff --git a/include/asm-generic/rwsem.h b/include/asm-generic/rwsem.h
index 3cb8d98..962e75b 100644
--- a/include/asm-generic/rwsem.h
+++ b/include/asm-generic/rwsem.h
@@ -106,8 +106,8 @@ static inline void __downgrade_write(struct rw_semaphore *sem)
* read-locked region is ok to be re-ordered into the
* write side. As such, rely on RELEASE semantics.
*/
- tmp = atomic_long_add_return_release(-RWSEM_WAITING_BIAS,
- (atomic_long_t *)&sem->count);
+ tmp = atomic_long_add_return_release(-RWSEM_ACTIVE_WRITE_BIAS +
+ RWSEM_ACTIVE_READ_BIAS, (atomic_long_t *)&sem->count);
if (tmp < 0)
rwsem_downgrade_wake(sem);
}
diff --git a/include/asm-generic/rwsem_types.h b/include/asm-generic/rwsem_types.h
index 093ef6a..6d55d25 100644
--- a/include/asm-generic/rwsem_types.h
+++ b/include/asm-generic/rwsem_types.h
@@ -7,20 +7,22 @@
* the semaphore definition
*
* The bias values and the counter type limits the number of
- * potential readers/writers to 32767 for 32 bits and 2147483647
- * for 64 bits.
+ * potential writers to 16383 for 32 bits and 1073741823 for 64 bits.
+ * The combined readers and writers can go up to 65534 for 32-bits and
+ * 4294967294 for 64-bits.
*/
#ifdef CONFIG_64BIT
# define RWSEM_ACTIVE_MASK 0xffffffffL
+# define RWSEM_WAITING_BIAS (-(1L << 62))
#else
# define RWSEM_ACTIVE_MASK 0x0000ffffL
+# define RWSEM_WAITING_BIAS (-(1L << 30))
#endif
#define RWSEM_UNLOCKED_VALUE 0x00000000L
#define RWSEM_ACTIVE_BIAS 0x00000001L
-#define RWSEM_WAITING_BIAS (-RWSEM_ACTIVE_MASK-1)
#define RWSEM_ACTIVE_READ_BIAS RWSEM_ACTIVE_BIAS
-#define RWSEM_ACTIVE_WRITE_BIAS (RWSEM_WAITING_BIAS + RWSEM_ACTIVE_BIAS)
+#define RWSEM_ACTIVE_WRITE_BIAS (-RWSEM_ACTIVE_MASK)
#endif /* __KERNEL__ */
#endif /* _ASM_GENERIC_RWSEM_TYPES_H */
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 6b5bd6e..8b118b3 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -29,28 +29,30 @@
* 0x00000000 rwsem is unlocked, and no one is waiting for the lock or
* attempting to read lock or write lock.
*
- * 0xffff000X (1) X readers active or attempting lock, with waiters for lock
+ * 0xc000000X (1) X readers active or attempting lock, with waiters for lock
* X = #active readers + # readers attempting lock
* (X*ACTIVE_BIAS + WAITING_BIAS)
- * (2) 1 writer attempting lock, no waiters for lock
+ *
+ * 0xffff000X (1) 1 writer attempting lock, no waiters for lock
* X-1 = #active readers + #readers attempting lock
* ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
- * (3) 1 writer active, no waiters for lock
+ * (2) 1 writer active, no waiters for lock
* X-1 = #active readers + #readers attempting lock
* ((X-1)*ACTIVE_BIAS + ACTIVE_WRITE_BIAS)
*
- * 0xffff0001 (1) 1 reader active or attempting lock, waiters for lock
+ * 0xc0000001 (1) 1 reader active or attempting lock, waiters for lock
* (WAITING_BIAS + ACTIVE_BIAS)
- * (2) 1 writer active or attempting lock, no waiters for lock
+ *
+ * 0xffff0001 (1) 1 writer active or attempting lock, no waiters for lock
* (ACTIVE_WRITE_BIAS)
*
- * 0xffff0000 (1) There are writers or readers queued but none active
+ * 0xc0000000 (1) There are writers or readers queued but none active
* or in the process of attempting lock.
* (WAITING_BIAS)
* Note: writer can attempt to steal lock for this count by adding
* ACTIVE_WRITE_BIAS in cmpxchg and checking the old count
*
- * 0xfffe0001 (1) 1 writer active, or attempting lock. Waiters on queue.
+ * 0xbfff0001 (1) 1 writer active, or attempting lock. Waiters on queue.
* (ACTIVE_WRITE_BIAS + WAITING_BIAS)
*
* Note: Readers attempt to lock by adding ACTIVE_BIAS in down_read and checking
@@ -62,9 +64,23 @@
* checking the count becomes ACTIVE_WRITE_BIAS for successful lock
* acquisition (i.e. nobody else has lock or attempts lock). If
* unsuccessful, in rwsem_down_write_failed, we'll check to see if there
- * are only waiters but none active (5th case above), and attempt to
+ * are only waiters but none active (7th case above), and attempt to
* steal the lock.
*
+ * We can infer the reader/writer/waiter state of the lock by looking
+ * at the count value:
+ * (1) count > 0
+ * Only readers are present.
+ * (2) WAITING_BIAS - ACTIVE_WRITE_BIAS < count < 0
+ * Have writers, maybe readers, but no waiter
+ * (3) WAITING_BIAS < count <= WAITING_BIAS - ACTIVE_WRITE_BIAS
+ * Have readers and waiters, but no writer
+ * (4) count < WAITING_BIAS
+ * Have writers and waiters, maybe readers
+ *
+ * IOW, writers are present when
+ * (1) count < WAITING_BIAS, or
+ * (2) WAITING_BIAS - ACTIVE_WRITE_BIAS < count < 0
*/
/*
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 08/10] locking/rwsem: Enable spinning readers
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
` (6 preceding siblings ...)
2016-08-18 21:11 ` [RFC PATCH-tip v4 07/10] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 09/10] locking/rwsem: Enable reactivation of reader spinning Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 10/10] locking/rwsem: Add a boot parameter to reader spinning threshold Waiman Long
9 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
This patch enables readers to optimistically spin when the
rspin_threshold is non-zero. That threshold value should only
be set when the lock owners of the rwsem are unlikely to go to
sleep. Otherwise enabling reader spinning may make the performance
worse in some cases.
On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
the fio test with multithreaded randrw and randwrite tests on the
same file on a XFS partition on top of a NVDIMM with DAX were run,
the aggregated bandwidths before and after the reader optimistic
spinning patchset were as follows:
Test BW before patch BW after patch % change
---- --------------- -------------- --------
randrw 1352 MB/s 2164 MB/s +60%
randwrite 1710 MB/s 2550 MB/s +49%
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
kernel/locking/rwsem-xadd.c | 48 ++++++++++++++++++++++++++++++++++++------
1 files changed, 41 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 8b118b3..d850214 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -83,6 +83,12 @@
* (2) WAITING_BIAS - ACTIVE_WRITE_BIAS < count < 0
*/
+static inline bool count_has_writer(long count)
+{
+ return (count < RWSEM_WAITING_BIAS) || ((count < 0) &&
+ (count > RWSEM_WAITING_BIAS - RWSEM_ACTIVE_WRITE_BIAS));
+}
+
/*
* Initialize an rwsem:
*/
@@ -286,6 +292,25 @@ static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
}
}
+/*
+ * Try to acquire read lock before the reader is put on wait queue
+ */
+static inline bool rwsem_try_read_lock_unqueued(struct rw_semaphore *sem)
+{
+ long count = atomic_long_read(&sem->count);
+
+ if (count_has_writer(count))
+ return false;
+ count = atomic_long_add_return_acquire(RWSEM_ACTIVE_READ_BIAS,
+ &sem->count);
+ if (!count_has_writer(count))
+ return true;
+
+ /* Back out the change */
+ atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
+ return false;
+}
+
static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
{
struct task_struct *owner;
@@ -352,7 +377,8 @@ out:
return rwsem_owner_is_reader(READ_ONCE(sem->owner)) ? 0 : 1;
}
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+ enum rwsem_waiter_type type)
{
bool taken = false;
int owner_state; /* Lock owner state */
@@ -380,10 +406,11 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
/*
* Try to acquire the lock
*/
- if (rwsem_try_write_lock_unqueued(sem)) {
- taken = true;
+ taken = (type = RWSEM_WAITING_FOR_WRITE)
+ ? rwsem_try_write_lock_unqueued(sem)
+ : rwsem_try_read_lock_unqueued(sem);
+ if (taken)
break;
- }
/*
* We only decremnt the rspin_cnt when the lock is owned
@@ -418,7 +445,8 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
* Check the success or failure of writer spinning on reader so as
* to adjust the rspin_enabled count accordingly.
*/
- if (rwsem_owner_is_reader(sem->owner)) {
+ if ((type = RWSEM_WAITING_FOR_WRITE) &&
+ rwsem_owner_is_reader(sem->owner)) {
/*
* Update rspin_enabled for reader spinning.
*
@@ -458,7 +486,8 @@ static inline bool reader_spinning_enabled(struct rw_semaphore *sem)
return sem->rspin_enabled;
}
#else
-static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
+static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
+ enum rwsem_waiter_type type)
{
return false;
}
@@ -492,6 +521,11 @@ struct rw_semaphore __sched *rwsem_down_read_failed(struct rw_semaphore *sem)
*/
atomic_long_add(-RWSEM_ACTIVE_READ_BIAS, &sem->count);
+ /* do optimistic spinning and steal lock if possible */
+ if (reader_spinning_enabled(sem) &&
+ rwsem_optimistic_spin(sem, RWSEM_WAITING_FOR_READ))
+ return sem;
+
waiter.task = tsk;
waiter.type = RWSEM_WAITING_FOR_READ;
@@ -548,7 +582,7 @@ __rwsem_down_write_failed_common(struct rw_semaphore *sem, int state)
count = atomic_long_sub_return(RWSEM_ACTIVE_WRITE_BIAS, &sem->count);
/* do optimistic spinning and steal lock if possible */
- if (rwsem_optimistic_spin(sem))
+ if (rwsem_optimistic_spin(sem, RWSEM_WAITING_FOR_WRITE))
return sem;
/*
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 09/10] locking/rwsem: Enable reactivation of reader spinning
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
` (7 preceding siblings ...)
2016-08-18 21:11 ` [RFC PATCH-tip v4 08/10] locking/rwsem: Enable spinning readers Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 10/10] locking/rwsem: Add a boot parameter to reader spinning threshold Waiman Long
9 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
Reader optimistic spinning will be disabled once the rspin_enabled
count reaches 0. After that, it cannot be re-enabled. This may cause
an eligible rwsem locked out of reader spinning because of a series
of unfortunate events.
This patch looks at the regular writer-on-writer spinning history. If
there are sufficient more successful spin attempts than failed ones,
it will try to reactivate reader spinning.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
include/linux/rwsem.h | 12 ++++++++----
kernel/locking/rwsem-xadd.c | 27 +++++++++++++++++++++++++--
2 files changed, 33 insertions(+), 6 deletions(-)
diff --git a/include/linux/rwsem.h b/include/linux/rwsem.h
index 8978f87..98284b4 100644
--- a/include/linux/rwsem.h
+++ b/include/linux/rwsem.h
@@ -32,7 +32,11 @@ struct rw_semaphore {
raw_spinlock_t wait_lock;
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
struct optimistic_spin_queue osq; /* spinner MCS lock */
- int rspin_enabled; /* protected by osq lock */
+ /*
+ * Reader optimistic spinning fields protected by osq lock
+ */
+ uint16_t rspin_enabled;
+ int16_t wspin_cnt;
/*
* Write owner. Used as a speculative check to see
@@ -74,10 +78,10 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
/*
* Each successful reader spin will increment the rspin_enabled by 1.
* Each unsuccessful spin, on the other hand, will decrement it by 2.
- * Reader spinning will be permanently disabled when it reaches 0.
+ * Reader spinning will be disabled when it reaches 0.
*/
#ifndef RWSEM_RSPIN_ENABLED_DEFAULT
-# define RWSEM_RSPIN_ENABLED_DEFAULT 40
+# define RWSEM_RSPIN_ENABLED_DEFAULT 30
#endif
#define RWSEM_RSPIN_ENABLED_MAX 1024
@@ -87,7 +91,7 @@ static inline int rwsem_is_locked(struct rw_semaphore *sem)
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
#define __RWSEM_OPT_INIT(lockname) , .osq = OSQ_LOCK_UNLOCKED, .owner = NULL, \
- .rspin_enabled = RWSEM_RSPIN_ENABLED_DEFAULT
+ .rspin_enabled = RWSEM_RSPIN_ENABLED_DEFAULT, .wspin_cnt = 0
#else
#define __RWSEM_OPT_INIT(lockname)
#endif
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index d850214..9978159 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -108,6 +108,7 @@ void __init_rwsem(struct rw_semaphore *sem, const char *name,
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
sem->owner = NULL;
sem->rspin_enabled = RWSEM_RSPIN_ENABLED_DEFAULT;
+ sem->wspin_cnt = 0;
osq_lock_init(&sem->osq);
#endif
}
@@ -458,10 +459,32 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
if (taken && (sem->rspin_enabled < RWSEM_RSPIN_ENABLED_MAX)) {
sem->rspin_enabled++;
} else if (!taken) {
- if (sem->rspin_enabled > 2)
+ if (sem->rspin_enabled > 2) {
sem->rspin_enabled -= 2;
- else
+ } else if (sem->rspin_enabled) {
sem->rspin_enabled = 0;
+ /*
+ * Reset wspin_cnt so that it won't get
+ * re-enabled too soon.
+ */
+ if (sem->wspin_cnt > -30)
+ sem->wspin_cnt = -30;
+ }
+ }
+ } else if (type = RWSEM_WAITING_FOR_WRITE) {
+ /*
+ * Every 10 successful writer-on-writer spins more than failed
+ * spins will increment rspin_enabled to encourage more
+ * writer-on-reader spinning attempts.
+ */
+ if (taken) {
+ if ((++sem->wspin_cnt >= 10) &&
+ (sem->rspin_enabled < RWSEM_RSPIN_ENABLED_MAX)) {
+ sem->wspin_cnt = 0;
+ sem->rspin_enabled++;
+ }
+ } else if (sem->wspin_cnt > -100) {
+ sem->wspin_cnt--;
}
}
osq_unlock(&sem->osq);
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* [RFC PATCH-tip v4 10/10] locking/rwsem: Add a boot parameter to reader spinning threshold
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
` (8 preceding siblings ...)
2016-08-18 21:11 ` [RFC PATCH-tip v4 09/10] locking/rwsem: Enable reactivation of reader spinning Waiman Long
@ 2016-08-18 21:11 ` Waiman Long
2016-08-24 4:00 ` Davidlohr Bueso
9 siblings, 1 reply; 33+ messages in thread
From: Waiman Long @ 2016-08-18 21:11 UTC (permalink / raw)
To: Peter Zijlstra, Ingo Molnar
Cc: linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, Waiman Long
The default reader spining threshold is current set to 4096. However,
the right reader spinning threshold may vary from one system to
another and among the different architectures. This patch adds a new
kernel boot parameter to modify the threshold value. This enables
better tailoring to the needs of different systems as well as for
testing purposes.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
Documentation/kernel-parameters.txt | 3 +++
kernel/locking/rwsem-xadd.c | 14 +++++++++++++-
2 files changed, 16 insertions(+), 1 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 46c030a..90892ac 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -3690,6 +3690,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
rw [KNL] Mount root device read-write on boot
+ rwsem_rspin_threshold+ [KNL] Set rw semaphore reader spinning threshold
+
S [KNL] Run init in single mode
s390_iommu= [HW,S390]
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 9978159..7bb6255 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -272,6 +272,18 @@ static inline bool rwsem_try_write_lock(long count, struct rw_semaphore *sem)
#ifdef CONFIG_RWSEM_SPIN_ON_OWNER
/*
+ * Reader spinning threshold
+ */
+static int __read_mostly rspin_threshold = RWSEM_RSPIN_THRESHOLD;
+
+static int __init set_rspin_threshold(char *str)
+{
+ get_option(&str, &rspin_threshold);
+ return 0;
+}
+early_param("rwsem_rspin_threshold", set_rspin_threshold);
+
+/*
* Try to acquire write lock before the writer has been put on wait queue.
*/
static inline bool rwsem_try_write_lock_unqueued(struct rw_semaphore *sem)
@@ -394,7 +406,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem,
if (!osq_lock(&sem->osq))
goto done;
- rspin_cnt = sem->rspin_enabled ? RWSEM_RSPIN_THRESHOLD : 0;
+ rspin_cnt = sem->rspin_enabled ? rspin_threshold : 0;
/*
* Optimistically spin on the owner field and attempt to acquire the
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 07/10] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation
2016-08-18 21:11 ` [RFC PATCH-tip v4 07/10] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation Waiman Long
@ 2016-08-19 5:57 ` Wanpeng Li
2016-08-19 16:21 ` Waiman Long
0 siblings, 1 reply; 33+ messages in thread
From: Wanpeng Li @ 2016-08-19 5:57 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel@vger.kernel.org,
the arch/x86 maintainers, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch
2016-08-19 5:11 GMT+08:00 Waiman Long <Waiman.Long@hpe.com>:
> When the count value is in between 0 and RWSEM_WAITING_BIAS, there
> are 2 possibilities.
> Either a writer is present and there is no waiter
count = 0xffff0001
>or there are waiters and readers. There is no easy way to
count = 0xffff000X
However, RWSEM_WAITING_BIAS is equal to 0xffff0000, so both these two
cases are beyond RWSEM_WAITING_BIAS, right?
Regards,
Wanpeng Li
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 07/10] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation
2016-08-19 5:57 ` Wanpeng Li
@ 2016-08-19 16:21 ` Waiman Long
2016-08-22 2:15 ` Wanpeng Li
0 siblings, 1 reply; 33+ messages in thread
From: Waiman Long @ 2016-08-19 16:21 UTC (permalink / raw)
To: Wanpeng Li
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel@vger.kernel.org,
the arch/x86 maintainers, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch
On 08/19/2016 01:57 AM, Wanpeng Li wrote:
> 2016-08-19 5:11 GMT+08:00 Waiman Long<Waiman.Long@hpe.com>:
>> When the count value is in between 0 and RWSEM_WAITING_BIAS, there
>> are 2 possibilities.
>> Either a writer is present and there is no waiter
> count = 0xffff0001
>
>> or there are waiters and readers. There is no easy way to
> count = 0xffff000X
>
> However, RWSEM_WAITING_BIAS is equal to 0xffff0000, so both these two
> cases are beyond RWSEM_WAITING_BIAS, right?
>
> Regards,
> Wanpeng Li
Perhaps I should make it clear that I am talking from a signed quantity
point of view (it is an atomic_long_t). So
RWSEM_WAITING_BIAS < RWSEM_ACTIVE_WRITE_BIAS < 0
Hope this clarify your question.
Cheers,
Longman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 07/10] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation
2016-08-19 16:21 ` Waiman Long
@ 2016-08-22 2:15 ` Wanpeng Li
0 siblings, 0 replies; 33+ messages in thread
From: Wanpeng Li @ 2016-08-22 2:15 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel@vger.kernel.org,
the arch/x86 maintainers, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Davidlohr Bueso, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch
2016-08-20 0:21 GMT+08:00 Waiman Long <waiman.long@hpe.com>:
> On 08/19/2016 01:57 AM, Wanpeng Li wrote:
>>
>> 2016-08-19 5:11 GMT+08:00 Waiman Long<Waiman.Long@hpe.com>:
>>>
>>> When the count value is in between 0 and RWSEM_WAITING_BIAS, there
>>> are 2 possibilities.
>>> Either a writer is present and there is no waiter
>>
>> count = 0xffff0001
>>
>>> or there are waiters and readers. There is no easy way to
>>
>> count = 0xffff000X
>>
>> However, RWSEM_WAITING_BIAS is equal to 0xffff0000, so both these two
>> cases are beyond RWSEM_WAITING_BIAS, right?
>>
>> Regards,
>> Wanpeng Li
>
>
> Perhaps I should make it clear that I am talking from a signed quantity
> point of view (it is an atomic_long_t). So
>
> RWSEM_WAITING_BIAS < RWSEM_ACTIVE_WRITE_BIAS < 0
>
> Hope this clarify your question.
Yeah, thank you. :)
Regards,
Wanpeng Li
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 10/10] locking/rwsem: Add a boot parameter to reader spinning threshold
2016-08-18 21:11 ` [RFC PATCH-tip v4 10/10] locking/rwsem: Add a boot parameter to reader spinning threshold Waiman Long
@ 2016-08-24 4:00 ` Davidlohr Bueso
2016-08-24 18:39 ` Waiman Long
0 siblings, 1 reply; 33+ messages in thread
From: Davidlohr Bueso @ 2016-08-24 4:00 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
On Thu, 18 Aug 2016, Waiman Long wrote:
>The default reader spining threshold is current set to 4096. However,
>the right reader spinning threshold may vary from one system to
>another and among the different architectures. This patch adds a new
>kernel boot parameter to modify the threshold value. This enables
>better tailoring to the needs of different systems as well as for
>testing purposes.
It also means that those systems could very easily be incorrectly tailored.
(and worse case scenario: reboot, is obviously a terrible way to get rid
of any issues). I very much disagree with exposing this sort of core stuff,
it should work well for everyone out of the box, not relying on users to
properly configure this.
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 10/10] locking/rwsem: Add a boot parameter to reader spinning threshold
2016-08-24 4:00 ` Davidlohr Bueso
@ 2016-08-24 18:39 ` Waiman Long
0 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2016-08-24 18:39 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
On 08/24/2016 12:00 AM, Davidlohr Bueso wrote:
> On Thu, 18 Aug 2016, Waiman Long wrote:
>
>> The default reader spining threshold is current set to 4096. However,
>> the right reader spinning threshold may vary from one system to
>> another and among the different architectures. This patch adds a new
>> kernel boot parameter to modify the threshold value. This enables
>> better tailoring to the needs of different systems as well as for
>> testing purposes.
>
> It also means that those systems could very easily be incorrectly
> tailored.
> (and worse case scenario: reboot, is obviously a terrible way to get rid
> of any issues). I very much disagree with exposing this sort of core
> stuff,
> it should work well for everyone out of the box, not relying on users to
> properly configure this.
>
> Thanks,
> Davidlohr
I also have some concern about exposing this kernel parameter as it will
be hard to tune. That is why I put it at the end to gauge the opinion of
others. I will leave this out when I send out the next version.
Cheers,
Longman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
2016-08-18 21:11 ` [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier Waiman Long
@ 2016-10-04 19:06 ` Davidlohr Bueso
2016-10-04 21:28 ` Jason Low
2016-10-05 12:19 ` Waiman Long
0 siblings, 2 replies; 33+ messages in thread
From: Davidlohr Bueso @ 2016-10-04 19:06 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
On Thu, 18 Aug 2016, Waiman Long wrote:
>The osq_lock() and osq_unlock() function may not provide the necessary
>acquire and release barrier in some cases. This patch makes sure
>that the proper barriers are provided when osq_lock() is successful
>or when osq_unlock() is called.
But why do we need these guarantees given that osq is only used internally
for lock owner spinning situations? Leaking out of the critical region will
obviously be bad if using it as a full lock, but, as is, this can only hurt
performance of two of the most popular locks in the kernel -- although yes,
using smp_acquire__after_ctrl_dep is nicer for polling.
If you need tighter osq for rwsems, could it be refactored such that mutexes
do not take a hit?
>
>Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
>---
> kernel/locking/osq_lock.c | 24 ++++++++++++++++++------
> 1 files changed, 18 insertions(+), 6 deletions(-)
>
>diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>index 05a3785..3da0b97 100644
>--- a/kernel/locking/osq_lock.c
>+++ b/kernel/locking/osq_lock.c
>@@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>
> cpu_relax_lowlatency();
> }
>+ /*
>+ * Add an acquire memory barrier for pairing with the release barrier
>+ * in unlock.
>+ */
>+ smp_acquire__after_ctrl_dep();
> return true;
>
> unqueue:
>@@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue *lock)
> * Second most likely case.
> */
> node = this_cpu_ptr(&osq_node);
>- next = xchg(&node->next, NULL);
>- if (next) {
>- WRITE_ONCE(next->locked, 1);
>+ next = xchg_relaxed(&node->next, NULL);
>+ if (next)
>+ goto unlock;
>+
>+ next = osq_wait_next(lock, node, NULL);
>+ if (unlikely(!next)) {
>+ /*
>+ * In the unlikely event that the OSQ is empty, we need to
>+ * provide a proper release barrier.
>+ */
>+ smp_mb();
> return;
> }
>
>- next = osq_wait_next(lock, node, NULL);
>- if (next)
>- WRITE_ONCE(next->locked, 1);
>+unlock:
>+ smp_store_release(&next->locked, 1);
> }
As well as for the smp_acquire__after_ctrl_dep comment you have above, this also
obviously pairs with the osq_lock's smp_load_acquire while backing out (unqueueing,
step A). Given the above, for this case we might also just rely on READ_ONCE(node->locked),
if we get the conditional wrong and miss the node becoming locked, all we do is another
iteration, and while there is a cmpxchg() there, it is mitigated with the ccas thingy.
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
2016-10-04 19:06 ` Davidlohr Bueso
@ 2016-10-04 21:28 ` Jason Low
2016-10-05 12:19 ` Waiman Long
1 sibling, 0 replies; 33+ messages in thread
From: Jason Low @ 2016-10-04 21:28 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Waiman Long, Peter Zijlstra, Ingo Molnar,
Linux Kernel Mailing List, x86, linux-alpha, linux-ia64,
linux-s390, linux-arch, linux-doc, Jason Low, Dave Chinner,
Jonathan Corbet, Scott J Norton, Douglas Hatch, jason.low2
On Tue, Oct 4, 2016 at 12:06 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> On Thu, 18 Aug 2016, Waiman Long wrote:
>
>> The osq_lock() and osq_unlock() function may not provide the necessary
>> acquire and release barrier in some cases. This patch makes sure
>> that the proper barriers are provided when osq_lock() is successful
>> or when osq_unlock() is called.
>
>
> But why do we need these guarantees given that osq is only used internally
> for lock owner spinning situations? Leaking out of the critical region will
> obviously be bad if using it as a full lock, but, as is, this can only hurt
> performance of two of the most popular locks in the kernel -- although yes,
> using smp_acquire__after_ctrl_dep is nicer for polling.
>
> If you need tighter osq for rwsems, could it be refactored such that mutexes
> do not take a hit?
I agree with David, the OSQ is meant to be used internally as a
queuing mechanism than as something for protecting critical regions,
and so these guarantees of preventing critical section leaks don't
seem to be necessary for the OSQ. If that is the case, then it would
be better to avoid the performance hit to mutexes and rwsems.
Jason
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
2016-10-04 19:06 ` Davidlohr Bueso
2016-10-04 21:28 ` Jason Low
@ 2016-10-05 12:19 ` Waiman Long
2016-10-05 15:11 ` Waiman Long
1 sibling, 1 reply; 33+ messages in thread
From: Waiman Long @ 2016-10-05 12:19 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
On 10/04/2016 03:06 PM, Davidlohr Bueso wrote:
> On Thu, 18 Aug 2016, Waiman Long wrote:
>
>> The osq_lock() and osq_unlock() function may not provide the necessary
>> acquire and release barrier in some cases. This patch makes sure
>> that the proper barriers are provided when osq_lock() is successful
>> or when osq_unlock() is called.
>
> But why do we need these guarantees given that osq is only used
> internally
> for lock owner spinning situations? Leaking out of the critical region
> will
> obviously be bad if using it as a full lock, but, as is, this can only
> hurt
> performance of two of the most popular locks in the kernel -- although
> yes,
> using smp_acquire__after_ctrl_dep is nicer for polling.
First of all, it is not obvious from the name osq_lock() that it is not
an acquire barrier in some cases. We either need to clearly document it
or has a variant name that indicate that, e.g. osq_lock_relaxed, for
example.
Secondly, if we look at the use cases of osq_lock(), the additional
latency (for non-x86 archs) only matters if the master lock is
immediately available for acquisition after osq_lock() return.
Otherwise, it will be hidden in the spinning loop for that master lock.
So yes, there may be a slight performance hit in some cases, but
certainly not always.
> If you need tighter osq for rwsems, could it be refactored such that
> mutexes
> do not take a hit?
>
Yes, we can certainly do that like splitting into 2 variants, one with
acquire barrier guarantee and one without.
>>
>> Suggested-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>> Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
>> ---
>> kernel/locking/osq_lock.c | 24 ++++++++++++++++++------
>> 1 files changed, 18 insertions(+), 6 deletions(-)
>>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 05a3785..3da0b97 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -124,6 +124,11 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>>
>> cpu_relax_lowlatency();
>> }
>> + /*
>> + * Add an acquire memory barrier for pairing with the release
>> barrier
>> + * in unlock.
>> + */
>> + smp_acquire__after_ctrl_dep();
>> return true;
>>
>> unqueue:
>> @@ -198,13 +203,20 @@ void osq_unlock(struct optimistic_spin_queue
>> *lock)
>> * Second most likely case.
>> */
>> node = this_cpu_ptr(&osq_node);
>> - next = xchg(&node->next, NULL);
>> - if (next) {
>> - WRITE_ONCE(next->locked, 1);
>> + next = xchg_relaxed(&node->next, NULL);
>> + if (next)
>> + goto unlock;
>> +
>> + next = osq_wait_next(lock, node, NULL);
>> + if (unlikely(!next)) {
>> + /*
>> + * In the unlikely event that the OSQ is empty, we need to
>> + * provide a proper release barrier.
>> + */
>> + smp_mb();
>> return;
>> }
>>
>> - next = osq_wait_next(lock, node, NULL);
>> - if (next)
>> - WRITE_ONCE(next->locked, 1);
>> +unlock:
>> + smp_store_release(&next->locked, 1);
>> }
>
> As well as for the smp_acquire__after_ctrl_dep comment you have above,
> this also
> obviously pairs with the osq_lock's smp_load_acquire while backing out
> (unqueueing,
> step A). Given the above, for this case we might also just rely on
> READ_ONCE(node->locked),
> if we get the conditional wrong and miss the node becoming locked, all
> we do is another
> iteration, and while there is a cmpxchg() there, it is mitigated with
> the ccas thingy.
Similar to osq_lock(), the current osq_unlock() does not have a release
barrier guarantee. I think splitting into 2 variants - osq_unlock,
osq_unlock_relaxed will help.
Cheers,
Longman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
2016-10-05 12:19 ` Waiman Long
@ 2016-10-05 15:11 ` Waiman Long
2016-10-06 5:47 ` Davidlohr Bueso
0 siblings, 1 reply; 33+ messages in thread
From: Waiman Long @ 2016-10-05 15:11 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
On 10/05/2016 08:19 AM, Waiman Long wrote:
> On 10/04/2016 03:06 PM, Davidlohr Bueso wrote:
>> On Thu, 18 Aug 2016, Waiman Long wrote:
>>
>>> The osq_lock() and osq_unlock() function may not provide the necessary
>>> acquire and release barrier in some cases. This patch makes sure
>>> that the proper barriers are provided when osq_lock() is successful
>>> or when osq_unlock() is called.
>>
>> But why do we need these guarantees given that osq is only used
>> internally
>> for lock owner spinning situations? Leaking out of the critical
>> region will
>> obviously be bad if using it as a full lock, but, as is, this can
>> only hurt
>> performance of two of the most popular locks in the kernel --
>> although yes,
>> using smp_acquire__after_ctrl_dep is nicer for polling.
>
> First of all, it is not obvious from the name osq_lock() that it is
> not an acquire barrier in some cases. We either need to clearly
> document it or has a variant name that indicate that, e.g.
> osq_lock_relaxed, for example.
>
> Secondly, if we look at the use cases of osq_lock(), the additional
> latency (for non-x86 archs) only matters if the master lock is
> immediately available for acquisition after osq_lock() return.
> Otherwise, it will be hidden in the spinning loop for that master
> lock. So yes, there may be a slight performance hit in some cases, but
> certainly not always.
>
>> If you need tighter osq for rwsems, could it be refactored such that
>> mutexes
>> do not take a hit?
>>
>
> Yes, we can certainly do that like splitting into 2 variants, one with
> acquire barrier guarantee and one without.
>
How about the following patch? Does that address your concern?
Cheers,
Longman
----------------------------------[ Cut Here
]--------------------------------------
[PATCH] locking/osq: Provide 2 variants of lock/unlock functions
Despite the lock/unlock names, the osq_lock() and osq_unlock()
functions were not proper acquire and release barriers. To clarify
the situation, two different variants are now provided:
1) osq_lock/osq_unlock - they are proper acquire (if successful)
and release barriers respectively.
2) osq_lock_relaxed/osq_unlock_relaxed - they do not provide the
acquire/release barrier guarantee.
Both the mutex and rwsem optimistic spinning codes are modified to
use the relaxed variants which will be faster in some architectures as
the proper memory barrier semantics are not needed for queuing purpose.
Signed-off-by: Waiman Long <Waiman.Long@hpe.com>
---
include/linux/osq_lock.h | 2 +
kernel/locking/mutex.c | 6 +-
kernel/locking/osq_lock.c | 89
+++++++++++++++++++++++++++++++++++-------
kernel/locking/rwsem-xadd.c | 4 +-
4 files changed, 81 insertions(+), 20 deletions(-)
diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 703ea5c..72357d0 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -30,7 +30,9 @@ static inline void osq_lock_init(struct
optimistic_spin_queue *lock)
}
extern bool osq_lock(struct optimistic_spin_queue *lock);
+extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);
+extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock);
static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
{
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..b1bf1e0 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
* acquire the mutex all at once, the spinners need to take a
* MCS (queued) lock first before spinning on the owner field.
*/
- if (!osq_lock(&lock->osq))
+ if (!osq_lock_relaxed(&lock->osq))
goto done;
while (true) {
@@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
}
mutex_set_owner(lock);
- osq_unlock(&lock->osq);
+ osq_unlock_relaxed(&lock->osq);
return true;
}
@@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
cpu_relax_lowlatency();
}
- osq_unlock(&lock->osq);
+ osq_unlock_relaxed(&lock->osq);
done:
/*
* If we fell out of the spin path because of need_resched(),
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..1e6823a 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -12,6 +12,23 @@
*/
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
osq_node);
+enum mbtype {
+ acquire,
+ release,
+ relaxed,
+};
+
+static __always_inline int
+_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new)
+{
+ if (barrier = acquire)
+ return atomic_cmpxchg_acquire(v, old, new);
+ else if (barrier = release)
+ return atomic_cmpxchg_release(v, old, new);
+ else
+ return atomic_cmpxchg_relaxed(v, old, new);
+}
+
/*
* We use the value 0 to represent "no CPU", thus the encoded value
* will be the CPU number incremented by 1.
@@ -35,7 +52,8 @@ static inline struct optimistic_spin_node
*decode_cpu(int encoded_cpu_val)
static inline struct optimistic_spin_node *
osq_wait_next(struct optimistic_spin_queue *lock,
struct optimistic_spin_node *node,
- struct optimistic_spin_node *prev)
+ struct optimistic_spin_node *prev,
+ const enum mbtype barrier)
{
struct optimistic_spin_node *next = NULL;
int curr = encode_cpu(smp_processor_id());
@@ -50,7 +68,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
for (;;) {
if (atomic_read(&lock->tail) = curr &&
- atomic_cmpxchg_acquire(&lock->tail, curr, old) = curr) {
+ _atomic_cmpxchg_(barrier, &lock->tail, curr, old) = curr) {
/*
* We were the last queued, we moved @lock back. @prev
* will now observe @lock and will complete its
@@ -70,7 +88,13 @@ osq_wait_next(struct optimistic_spin_queue *lock,
* wait for a new @node->next from its Step-C.
*/
if (node->next) {
- next = xchg(&node->next, NULL);
+ if (barrier = acquire)
+ next = xchg_acquire(&node->next, NULL);
+ else if (barrier = release)
+ next = xchg_release(&node->next, NULL);
+ else
+ next = xchg_relaxed(&node->next, NULL);
+
if (next)
break;
}
@@ -81,7 +105,11 @@ osq_wait_next(struct optimistic_spin_queue *lock,
return next;
}
-bool osq_lock(struct optimistic_spin_queue *lock)
+/*
+ * We don't need to provide any memory barrier guarantee if the locking
fails.
+ */
+static inline bool
+__osq_lock(struct optimistic_spin_queue *lock, const enum mbtype barrier)
{
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
@@ -124,6 +152,12 @@ bool osq_lock(struct optimistic_spin_queue *lock)
cpu_relax_lowlatency();
}
+ /*
+ * Add an acquire memory barrier, if necessary, for pairing with the
+ * release barrier in unlock.
+ */
+ if (barrier = acquire)
+ smp_acquire__after_ctrl_dep();
return true;
unqueue:
@@ -137,7 +171,7 @@ unqueue:
for (;;) {
if (prev->next = node &&
- cmpxchg(&prev->next, node, NULL) = node)
+ cmpxchg_relaxed(&prev->next, node, NULL) = node)
break;
/*
@@ -164,7 +198,7 @@ unqueue:
* back to @prev.
*/
- next = osq_wait_next(lock, node, prev);
+ next = osq_wait_next(lock, node, prev, relaxed);
if (!next)
return false;
@@ -182,7 +216,8 @@ unqueue:
return false;
}
-void osq_unlock(struct optimistic_spin_queue *lock)
+static inline void
+__osq_unlock(struct optimistic_spin_queue *lock, const enum mbtype barrier)
{
struct optimistic_spin_node *node, *next;
int curr = encode_cpu(smp_processor_id());
@@ -190,21 +225,45 @@ void osq_unlock(struct optimistic_spin_queue *lock)
/*
* Fast path for the uncontended case.
*/
- if (likely(atomic_cmpxchg_release(&lock->tail, curr,
- OSQ_UNLOCKED_VAL) = curr))
+ if (likely(_atomic_cmpxchg_(barrier, &lock->tail, curr,
+ OSQ_UNLOCKED_VAL) = curr))
return;
/*
* Second most likely case.
*/
node = this_cpu_ptr(&osq_node);
- next = xchg(&node->next, NULL);
- if (next) {
- WRITE_ONCE(next->locked, 1);
+ next = xchg_relaxed(&node->next, NULL);
+ if (next)
+ goto unlock;
+
+ next = osq_wait_next(lock, node, NULL, barrier);
+ if (unlikely(!next))
return;
- }
- next = osq_wait_next(lock, node, NULL);
- if (next)
+unlock:
+ if (barrier = release)
+ smp_store_release(&next->locked, 1);
+ else
WRITE_ONCE(next->locked, 1);
}
+
+bool osq_lock(struct optimistic_spin_queue *lock)
+{
+ return __osq_lock(lock, acquire);
+}
+
+bool osq_lock_relaxed(struct optimistic_spin_queue *lock)
+{
+ return __osq_lock(lock, relaxed);
+}
+
+void osq_unlock(struct optimistic_spin_queue *lock)
+{
+ __osq_unlock(lock, release);
+}
+
+void osq_unlock_relaxed(struct optimistic_spin_queue *lock)
+{
+ __osq_unlock(lock, relaxed);
+}
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..88e95b1 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -389,7 +389,7 @@ static bool rwsem_optimistic_spin(struct
rw_semaphore *sem)
if (!rwsem_can_spin_on_owner(sem))
goto done;
- if (!osq_lock(&sem->osq))
+ if (!osq_lock_relaxed(&sem->osq))
goto done;
/*
@@ -425,7 +425,7 @@ static bool rwsem_optimistic_spin(struct
rw_semaphore *sem)
*/
cpu_relax_lowlatency();
}
- osq_unlock(&sem->osq);
+ osq_unlock_relaxed(&sem->osq);
done:
preempt_enable();
return taken;
--
1.7.1
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
2016-10-05 15:11 ` Waiman Long
@ 2016-10-06 5:47 ` Davidlohr Bueso
2016-10-06 19:30 ` Waiman Long
2016-10-06 19:31 ` [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier Jason Low
0 siblings, 2 replies; 33+ messages in thread
From: Davidlohr Bueso @ 2016-10-06 5:47 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
On Wed, 05 Oct 2016, Waiman Long wrote:
>diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>index 05a3785..1e6823a 100644
>--- a/kernel/locking/osq_lock.c
>+++ b/kernel/locking/osq_lock.c
>@@ -12,6 +12,23 @@
> */
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
>osq_node);
>
>+enum mbtype {
>+ acquire,
>+ release,
>+ relaxed,
>+};
No, please.
>+
>+static __always_inline int
>+_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int new)
>+{
>+ if (barrier = acquire)
>+ return atomic_cmpxchg_acquire(v, old, new);
>+ else if (barrier = release)
>+ return atomic_cmpxchg_release(v, old, new);
>+ else
>+ return atomic_cmpxchg_relaxed(v, old, new);
>+}
Things like the above are icky. How about something like below? I'm not
crazy about it, but there are other similar macros, ie lockref. We still
provide the osq_lock/unlock to imply acquire/release and the new _relaxed
flavor, as I agree that should be the correct naming
While I have not touched osq_wait_next(), the following are impacted:
- node->locked is now completely without ordering for _relaxed() (currently
its under smp_load_acquire, which does not match and the race is harmless
to begin with as we just iterate again. For the acquire flavor, it is always
formed with ctr dep + smp_rmb().
- If osq_lock() fails we never guarantee any ordering.
What do you think?
Thanks,
Davidlohr
diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 703ea5c30a33..183ee51e6e54 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -29,9 +29,20 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
}
+/*
+ * Versions of osq_lock/unlock that do not imply or guarantee (load)-ACQUIRE
+ * (store)-RELEASE barrier semantics.
+ *
+ * Note that a failed call to either osq_lock() or osq_lock_relaxed() does
+ * not imply barriers... we are next to block.
+ */
+extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock);
+extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock);
+
extern bool osq_lock(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);
+
static inline bool osq_is_locked(struct optimistic_spin_queue *lock)
{
return atomic_read(&lock->tail) != OSQ_UNLOCKED_VAL;
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90db3909..b1bf1e057565 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
* acquire the mutex all at once, the spinners need to take a
* MCS (queued) lock first before spinning on the owner field.
*/
- if (!osq_lock(&lock->osq))
+ if (!osq_lock_relaxed(&lock->osq))
goto done;
while (true) {
@@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
}
mutex_set_owner(lock);
- osq_unlock(&lock->osq);
+ osq_unlock_relaxed(&lock->osq);
return true;
}
@@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
cpu_relax_lowlatency();
}
- osq_unlock(&lock->osq);
+ osq_unlock_relaxed(&lock->osq);
done:
/*
* If we fell out of the spin path because of need_resched(),
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a37857ab55..8c3d57698702 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -28,6 +28,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
return per_cpu_ptr(&osq_node, cpu_nr);
}
+static inline void set_node_locked_release(struct optimistic_spin_node *node)
+{
+ smp_store_release(&node->locked, 1);
+}
+
+static inline void set_node_locked_relaxed(struct optimistic_spin_node *node)
+{
+ WRITE_ONCE(node->locked, 1);
+
+}
+
/*
* Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
* Can return NULL in case we were the last queued and we updated @lock instead.
@@ -81,130 +92,140 @@ osq_wait_next(struct optimistic_spin_queue *lock,
return next;
}
-bool osq_lock(struct optimistic_spin_queue *lock)
-{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
- struct optimistic_spin_node *prev, *next;
- int curr = encode_cpu(smp_processor_id());
- int old;
-
- node->locked = 0;
- node->next = NULL;
- node->cpu = curr;
-
- /*
- * We need both ACQUIRE (pairs with corresponding RELEASE in
- * unlock() uncontended, or fastpath) and RELEASE (to publish
- * the node fields we just initialised) semantics when updating
- * the lock tail.
- */
- old = atomic_xchg(&lock->tail, curr);
- if (old = OSQ_UNLOCKED_VAL)
- return true;
-
- prev = decode_cpu(old);
- node->prev = prev;
- WRITE_ONCE(prev->next, node);
-
- /*
- * Normally @prev is untouchable after the above store; because at that
- * moment unlock can proceed and wipe the node element from stack.
- *
- * However, since our nodes are static per-cpu storage, we're
- * guaranteed their existence -- this allows us to apply
- * cmpxchg in an attempt to undo our queueing.
- */
-
- while (!READ_ONCE(node->locked)) {
- /*
- * If we need to reschedule bail... so we can block.
- */
- if (need_resched())
- goto unqueue;
-
- cpu_relax_lowlatency();
- }
- return true;
-
-unqueue:
- /*
- * Step - A -- stabilize @prev
- *
- * Undo our @prev->next assignment; this will make @prev's
- * unlock()/unqueue() wait for a next pointer since @lock points to us
- * (or later).
- */
-
- for (;;) {
- if (prev->next = node &&
- cmpxchg(&prev->next, node, NULL) = node)
- break;
-
- /*
- * We can only fail the cmpxchg() racing against an unlock(),
- * in which case we should observe @node->locked becomming
- * true.
- */
- if (smp_load_acquire(&node->locked))
- return true;
-
- cpu_relax_lowlatency();
-
- /*
- * Or we race against a concurrent unqueue()'s step-B, in which
- * case its step-C will write us a new @node->prev pointer.
- */
- prev = READ_ONCE(node->prev);
- }
-
- /*
- * Step - B -- stabilize @next
- *
- * Similar to unlock(), wait for @node->next or move @lock from @node
- * back to @prev.
- */
-
- next = osq_wait_next(lock, node, prev);
- if (!next)
- return false;
-
- /*
- * Step - C -- unlink
- *
- * @prev is stable because its still waiting for a new @prev->next
- * pointer, @next is stable because our @node->next pointer is NULL and
- * it will wait in Step-A.
- */
-
- WRITE_ONCE(next->prev, prev);
- WRITE_ONCE(prev->next, next);
-
- return false;
+#define OSQ_LOCK(EXT, FENCECB) \
+bool osq_lock##EXT(struct optimistic_spin_queue *lock) \
+{ \
+ struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); \
+ struct optimistic_spin_node *prev, *next; \
+ int old, curr = encode_cpu(smp_processor_id()); \
+ \
+ node->locked = 0; \
+ node->next = NULL; \
+ node->cpu = curr; \
+ \
+ /* \
+ * We need both ACQUIRE (pairs with corresponding RELEASE in \
+ * unlock() uncontended, or fastpath) and RELEASE (to publish \
+ * the node fields we just initialised) semantics when updating \
+ * the lock tail. \
+ */ \
+ old = atomic_xchg(&lock->tail, curr); \
+ if (old = OSQ_UNLOCKED_VAL) \
+ return true; \
+ \
+ prev = decode_cpu(old); \
+ node->prev = prev; \
+ WRITE_ONCE(prev->next, node); \
+ \
+ /* \
+ * Normally @prev is untouchable after the above store; because \
+ * at that moment unlock can proceed and wipe the node element \
+ * from stack. \
+ * \
+ * However, since our nodes are static per-cpu storage, we're \
+ * guaranteed their existence -- this allows us to apply \
+ * cmpxchg in an attempt to undo our queueing. \
+ */ \
+ while (!READ_ONCE(node->locked)) { \
+ /* \
+ * If we need to reschedule bail... so we can block. \
+ */ \
+ if (need_resched()) \
+ goto unqueue; \
+ \
+ cpu_relax_lowlatency(); \
+ } \
+ FENCECB; \
+ return true; \
+ \
+unqueue: \
+ /* \
+ * Step - A -- stabilize @prev \
+ * \
+ * Undo our @prev->next assignment; this will make @prev's \
+ * unlock()/unqueue() wait for a next pointer since @lock \
+ * points to us (or later). \
+ */ \
+ for (;;) { \
+ /* \
+ * Failed calls to osq_lock() do not guarantee \
+ * barriers, thus always rely on RELAXED semantics. \
+ */ \
+ if (prev->next = node && \
+ cmpxchg_relaxed(&prev->next, node, NULL) = node) \
+ break; \
+ \
+ /* \
+ * We can only fail the cmpxchg() racing against an \
+ * unlock(), in which case we should observe \
+ * @node->locked becoming true. \
+ */ \
+ if (READ_ONCE(node->locked)) { \
+ FENCECB; \
+ return true; \
+ } \
+ \
+ cpu_relax_lowlatency(); \
+ \
+ /* \
+ * Or we race against a concurrent unqueue()'s step-B, \
+ * in which case its step-C will write us a new \
+ * @node->prev pointer. \
+ */ \
+ prev = READ_ONCE(node->prev); \
+ } \
+ \
+ /* \
+ * Step - B -- stabilize @next \
+ * \
+ * Similar to unlock(), wait for @node->next or move @lock \
+ * from @node back to @prev. \
+ */ \
+ \
+ next = osq_wait_next(lock, node, prev); \
+ if (!next) \
+ return false; \
+ \
+ /* \
+ * Step - C -- unlink \
+ * \
+ * @prev is stable because its still waiting for a new \
+ * @prev->next pointer, @next is stable because our \
+ * @node->next pointer is NULL and it will wait in Step-A. \
+ */ \
+ \
+ WRITE_ONCE(next->prev, prev); \
+ WRITE_ONCE(prev->next, next); \
+ \
+ return false; \
}
-void osq_unlock(struct optimistic_spin_queue *lock)
-{
- struct optimistic_spin_node *node, *next;
- int curr = encode_cpu(smp_processor_id());
-
- /*
- * Fast path for the uncontended case.
- */
- if (likely(atomic_cmpxchg_release(&lock->tail, curr,
- OSQ_UNLOCKED_VAL) = curr))
- return;
-
- /*
- * Second most likely case.
- */
- node = this_cpu_ptr(&osq_node);
- next = xchg(&node->next, NULL);
- if (next) {
- WRITE_ONCE(next->locked, 1);
- return;
- }
-
- next = osq_wait_next(lock, node, NULL);
- if (next)
- WRITE_ONCE(next->locked, 1);
+OSQ_LOCK(, smp_acquire__after_ctrl_dep())
+OSQ_LOCK(_relaxed, )
+
+#define OSQ_UNLOCK(EXT, FENCE) \
+void osq_unlock##EXT(struct optimistic_spin_queue *lock) \
+{ \
+ struct optimistic_spin_node *node, *next; \
+ int curr = encode_cpu(smp_processor_id()); \
+ \
+ /* Fast path for the uncontended case. */ \
+ if (likely(atomic_cmpxchg_##FENCE(&lock->tail, curr, \
+ OSQ_UNLOCKED_VAL) = curr)) \
+ return; \
+ \
+ /* Second most likely case. */ \
+ node = this_cpu_ptr(&osq_node); \
+ next = xchg(&node->next, NULL); \
+ if (next) \
+ goto done_setlocked; \
+ \
+ next = osq_wait_next(lock, node, NULL); \
+ if (!next) \
+ return; \
+done_setlocked: \
+ set_node_locked_##FENCE(next); \
}
+
+OSQ_UNLOCK(, release)
+OSQ_UNLOCK(_relaxed, relaxed)
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4bb2366..88e95b114392 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -389,7 +389,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
if (!rwsem_can_spin_on_owner(sem))
goto done;
- if (!osq_lock(&sem->osq))
+ if (!osq_lock_relaxed(&sem->osq))
goto done;
/*
@@ -425,7 +425,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
*/
cpu_relax_lowlatency();
}
- osq_unlock(&sem->osq);
+ osq_unlock_relaxed(&sem->osq);
done:
preempt_enable();
return taken;
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-08-18 21:11 ` [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP Waiman Long
@ 2016-10-06 18:17 ` Davidlohr Bueso
2016-10-06 21:47 ` Dave Chinner
0 siblings, 1 reply; 33+ messages in thread
From: Davidlohr Bueso @ 2016-10-06 18:17 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
On Thu, 18 Aug 2016, Waiman Long wrote:
>Currently, when down_read() fails, the active read locking isn't undone
>until the rwsem_down_read_failed() function grabs the wait_lock. If the
>wait_lock is contended, it may takes a while to get the lock. During
>that period, writer lock stealing will be disabled because of the
>active read lock.
>
>This patch will release the active read lock ASAP so that writer lock
>stealing can happen sooner. The only downside is when the reader is
>the first one in the wait queue as it has to issue another atomic
>operation to update the count.
>
>On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
>the fio test with multithreaded randrw and randwrite tests on the
>same file on a XFS partition on top of a NVDIMM with DAX were run,
>the aggregated bandwidths before and after the patch were as follows:
>
> Test BW before patch BW after patch % change
> ---- --------------- -------------- --------
> randrw 1210 MB/s 1352 MB/s +12%
> randwrite 1622 MB/s 1710 MB/s +5.4%
Yeah, this is really a bad workload to make decisions on locking
heuristics imo - if I'm thinking of the same workload. Mainly because
concurrent buffered io to the same file isn't very realistic and you
end up pathologically pounding on i_rwsem (which used to be until
recently i_mutex until Al's parallel lookup/readdir). Obviously write
lock stealing wins in this case.
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
2016-10-06 5:47 ` Davidlohr Bueso
@ 2016-10-06 19:30 ` Waiman Long
2016-10-10 5:39 ` [PATCH] locking/osq: Provide proper lock/unlock and relaxed flavors Davidlohr Bueso
2016-10-06 19:31 ` [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier Jason Low
1 sibling, 1 reply; 33+ messages in thread
From: Waiman Long @ 2016-10-06 19:30 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
On 10/06/2016 01:47 AM, Davidlohr Bueso wrote:
> On Wed, 05 Oct 2016, Waiman Long wrote:
>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 05a3785..1e6823a 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -12,6 +12,23 @@
>> */
>> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
>> osq_node);
>>
>> +enum mbtype {
>> + acquire,
>> + release,
>> + relaxed,
>> +};
>
> No, please.
>
>> +
>> +static __always_inline int
>> +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old,
>> int new)
>> +{
>> + if (barrier = acquire)
>> + return atomic_cmpxchg_acquire(v, old, new);
>> + else if (barrier = release)
>> + return atomic_cmpxchg_release(v, old, new);
>> + else
>> + return atomic_cmpxchg_relaxed(v, old, new);
>> +}
>
> Things like the above are icky. How about something like below? I'm not
> crazy about it, but there are other similar macros, ie lockref. We still
> provide the osq_lock/unlock to imply acquire/release and the new _relaxed
> flavor, as I agree that should be the correct naming
>
> While I have not touched osq_wait_next(), the following are impacted:
>
> - node->locked is now completely without ordering for _relaxed()
> (currently
> its under smp_load_acquire, which does not match and the race is harmless
> to begin with as we just iterate again. For the acquire flavor, it is
> always
> formed with ctr dep + smp_rmb().
>
> - If osq_lock() fails we never guarantee any ordering.
>
> What do you think?
>
> Thanks,
> Davidlohr
Yes, I am OK with your change. However, I need some additional changes
in osq_wait_next() as well. Either it is changed to use the release
variants of atomic_cmpxchg and xchg or using macro like what you did
with osq_lock and osq_unlock. The release variant is needed in the
osq_lock(). As osq_wait_next() is only invoked in the failure path of
osq_lock(), the barrier type doesn't really matter.
Cheers,
Longman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier
2016-10-06 5:47 ` Davidlohr Bueso
2016-10-06 19:30 ` Waiman Long
@ 2016-10-06 19:31 ` Jason Low
1 sibling, 0 replies; 33+ messages in thread
From: Jason Low @ 2016-10-06 19:31 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Waiman Long, Peter Zijlstra, Ingo Molnar,
Linux Kernel Mailing List, x86, linux-alpha, linux-ia64,
linux-s390, linux-arch, linux-doc, Dave Chinner, Jonathan Corbet,
Scott J Norton, Douglas Hatch, jason.low2
On Wed, Oct 5, 2016 at 10:47 PM, Davidlohr Bueso <dave@stgolabs.net> wrote:
> On Wed, 05 Oct 2016, Waiman Long wrote:
>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index 05a3785..1e6823a 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -12,6 +12,23 @@
>> */
>> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
>> osq_node);
>>
>> +enum mbtype {
>> + acquire,
>> + release,
>> + relaxed,
>> +};
>
>
> No, please.
>
>> +
>> +static __always_inline int
>> +_atomic_cmpxchg_(const enum mbtype barrier, atomic_t *v, int old, int
>> new)
>> +{
>> + if (barrier = acquire)
>> + return atomic_cmpxchg_acquire(v, old, new);
>> + else if (barrier = release)
>> + return atomic_cmpxchg_release(v, old, new);
>> + else
>> + return atomic_cmpxchg_relaxed(v, old, new);
>> +}
>
>
> Things like the above are icky. How about something like below? I'm not
> crazy about it, but there are other similar macros, ie lockref. We still
> provide the osq_lock/unlock to imply acquire/release and the new _relaxed
> flavor, as I agree that should be the correct naming
>
> While I have not touched osq_wait_next(), the following are impacted:
>
> - node->locked is now completely without ordering for _relaxed() (currently
> its under smp_load_acquire, which does not match and the race is harmless
> to begin with as we just iterate again. For the acquire flavor, it is always
> formed with ctr dep + smp_rmb().
>
> - If osq_lock() fails we never guarantee any ordering.
>
> What do you think?
I think that this method looks cleaner than the version with the
conditionals and enum.
My first preference though would be to document that the current code
doesn't provide the acquire semantics. The thinking is that if OSQ is
just used internally as a queuing mechanism and isn't used as a
traditional lock guarding critical sections, then it may just be
misleading and just adds more complexity.
If we do want a separate acquire and relaxed variants, then I think
David's version using macros is a bit more in line with other locking
code.
Jason
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-10-06 18:17 ` Davidlohr Bueso
@ 2016-10-06 21:47 ` Dave Chinner
2016-10-06 22:51 ` Davidlohr Bueso
` (2 more replies)
0 siblings, 3 replies; 33+ messages in thread
From: Dave Chinner @ 2016-10-06 21:47 UTC (permalink / raw)
To: Davidlohr Bueso
Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, linux-kernel, x86,
linux-alpha, linux-ia64, linux-s390, linux-arch, linux-doc,
Jason Low, Jonathan Corbet, Scott J Norton, Douglas Hatch
On Thu, Oct 06, 2016 at 11:17:18AM -0700, Davidlohr Bueso wrote:
> On Thu, 18 Aug 2016, Waiman Long wrote:
>
> >Currently, when down_read() fails, the active read locking isn't undone
> >until the rwsem_down_read_failed() function grabs the wait_lock. If the
> >wait_lock is contended, it may takes a while to get the lock. During
> >that period, writer lock stealing will be disabled because of the
> >active read lock.
> >
> >This patch will release the active read lock ASAP so that writer lock
> >stealing can happen sooner. The only downside is when the reader is
> >the first one in the wait queue as it has to issue another atomic
> >operation to update the count.
> >
> >On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
> >the fio test with multithreaded randrw and randwrite tests on the
> >same file on a XFS partition on top of a NVDIMM with DAX were run,
> >the aggregated bandwidths before and after the patch were as follows:
> >
> > Test BW before patch BW after patch % change
> > ---- --------------- -------------- --------
> > randrw 1210 MB/s 1352 MB/s +12%
> > randwrite 1622 MB/s 1710 MB/s +5.4%
>
> Yeah, this is really a bad workload to make decisions on locking
> heuristics imo - if I'm thinking of the same workload. Mainly because
> concurrent buffered io to the same file isn't very realistic and you
> end up pathologically pounding on i_rwsem (which used to be until
> recently i_mutex until Al's parallel lookup/readdir). Obviously write
> lock stealing wins in this case.
Except that it's DAX, and in 4.7-rc1 that used shared locking at the
XFS level and never took exclusive locks.
*However*, the DAX IO path locking in XFS has changed in 4.9-rc1 to
match the buffered IO single writer POSIX semantics - the test is a
bad test based on the fact it exercised a path that is under heavy
development and so can't be used as a regression test across
multiple kernels.
If you want to stress concurrent access to a single file, please
use direct IO, not DAX or buffered IO.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-10-06 21:47 ` Dave Chinner
@ 2016-10-06 22:51 ` Davidlohr Bueso
2016-10-07 21:45 ` Waiman Long
2016-10-09 15:17 ` Christoph Hellwig
2 siblings, 0 replies; 33+ messages in thread
From: Davidlohr Bueso @ 2016-10-06 22:51 UTC (permalink / raw)
To: Dave Chinner
Cc: Waiman Long, Peter Zijlstra, Ingo Molnar, linux-kernel, x86,
linux-alpha, linux-ia64, linux-s390, linux-arch, linux-doc,
Jason Low, Jonathan Corbet, Scott J Norton, Douglas Hatch
On Fri, 07 Oct 2016, Dave Chinner wrote:
>Except that it's DAX
Duh, of course; silly me.
Thanks,
Davidlohr
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-10-06 21:47 ` Dave Chinner
2016-10-06 22:51 ` Davidlohr Bueso
@ 2016-10-07 21:45 ` Waiman Long
2016-10-09 15:17 ` Christoph Hellwig
2 siblings, 0 replies; 33+ messages in thread
From: Waiman Long @ 2016-10-07 21:45 UTC (permalink / raw)
To: Dave Chinner
Cc: Davidlohr Bueso, Peter Zijlstra, Ingo Molnar, linux-kernel, x86,
linux-alpha, linux-ia64, linux-s390, linux-arch, linux-doc,
Jason Low, Jonathan Corbet, Scott J Norton, Douglas Hatch
On 10/06/2016 05:47 PM, Dave Chinner wrote:
> On Thu, Oct 06, 2016 at 11:17:18AM -0700, Davidlohr Bueso wrote:
>> On Thu, 18 Aug 2016, Waiman Long wrote:
>>
>>> Currently, when down_read() fails, the active read locking isn't undone
>>> until the rwsem_down_read_failed() function grabs the wait_lock. If the
>>> wait_lock is contended, it may takes a while to get the lock. During
>>> that period, writer lock stealing will be disabled because of the
>>> active read lock.
>>>
>>> This patch will release the active read lock ASAP so that writer lock
>>> stealing can happen sooner. The only downside is when the reader is
>>> the first one in the wait queue as it has to issue another atomic
>>> operation to update the count.
>>>
>>> On a 4-socket Haswell machine running on a 4.7-rc1 tip-based kernel,
>>> the fio test with multithreaded randrw and randwrite tests on the
>>> same file on a XFS partition on top of a NVDIMM with DAX were run,
>>> the aggregated bandwidths before and after the patch were as follows:
>>>
>>> Test BW before patch BW after patch % change
>>> ---- --------------- -------------- --------
>>> randrw 1210 MB/s 1352 MB/s +12%
>>> randwrite 1622 MB/s 1710 MB/s +5.4%
>> Yeah, this is really a bad workload to make decisions on locking
>> heuristics imo - if I'm thinking of the same workload. Mainly because
>> concurrent buffered io to the same file isn't very realistic and you
>> end up pathologically pounding on i_rwsem (which used to be until
>> recently i_mutex until Al's parallel lookup/readdir). Obviously write
>> lock stealing wins in this case.
> Except that it's DAX, and in 4.7-rc1 that used shared locking at the
> XFS level and never took exclusive locks.
>
> *However*, the DAX IO path locking in XFS has changed in 4.9-rc1 to
> match the buffered IO single writer POSIX semantics - the test is a
> bad test based on the fact it exercised a path that is under heavy
> development and so can't be used as a regression test across
> multiple kernels.
>
> If you want to stress concurrent access to a single file, please
> use direct IO, not DAX or buffered IO.
Thanks for the update. I will change the test when I update this patch.
Cheers,
Longman
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-10-06 21:47 ` Dave Chinner
2016-10-06 22:51 ` Davidlohr Bueso
2016-10-07 21:45 ` Waiman Long
@ 2016-10-09 15:17 ` Christoph Hellwig
2016-10-10 6:07 ` Dave Chinner
2 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2016-10-09 15:17 UTC (permalink / raw)
To: Dave Chinner
Cc: Davidlohr Bueso, Waiman Long, Peter Zijlstra, Ingo Molnar,
linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Jason Low, Jonathan Corbet, Scott J Norton,
Douglas Hatch
On Fri, Oct 07, 2016 at 08:47:51AM +1100, Dave Chinner wrote:
> Except that it's DAX, and in 4.7-rc1 that used shared locking at the
> XFS level and never took exclusive locks.
>
> *However*, the DAX IO path locking in XFS has changed in 4.9-rc1 to
> match the buffered IO single writer POSIX semantics - the test is a
> bad test based on the fact it exercised a path that is under heavy
> development and so can't be used as a regression test across
> multiple kernels.
That being said - I wonder if we should allow the shared lock on DAX
files IFF the user is specifying O_DIRECT in the open mode..
^ permalink raw reply [flat|nested] 33+ messages in thread
* [PATCH] locking/osq: Provide proper lock/unlock and relaxed flavors
2016-10-06 19:30 ` Waiman Long
@ 2016-10-10 5:39 ` Davidlohr Bueso
0 siblings, 0 replies; 33+ messages in thread
From: Davidlohr Bueso @ 2016-10-10 5:39 UTC (permalink / raw)
To: Waiman Long
Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, x86, linux-alpha,
linux-ia64, linux-s390, linux-arch, linux-doc, Jason Low,
Dave Chinner, Jonathan Corbet, Scott J Norton, Douglas Hatch
Because osq has only been used for mutex/rwsem spinning logic,
we have gotten away with being rather flexible in any of the
traditional lock/unlock ACQUIRE/RELEASE minimal guarantees.
However, if wanted to be used as a _real_ lock, then it would
be in trouble. To this end, this patch provides the two
alternatives, where osq_lock/unlock() calls have the required
semantics, and a _relaxed() call, for no ordering guarantees
at all.
- node->locked is now completely without ordering for _relaxed()
(currently its under smp_load_acquire, which does not match and
the race is harmless to begin with as we just iterate again. For
the ACQUIRE flavor, it is always formed with ctr dep + smp_rmb().
- In order to avoid more code duplication via macros, the common
osq_wait_next() call is completely unordered, but the caller
can provide the necessary barriers, if required - ie the case for
osq_unlock(): similar to the node->locked case, this also relies
on ctrl dep + smp_wmb() to form RELEASE.
- If osq_lock() fails we never guarantee any ordering (obviously
same goes for _relaxed).
Both mutexes and rwsems have been updated to continue using the
relaxed versions, but this will obviously change for the later.
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
---
XXX: This obviously needs a lot of testing.
include/asm-generic/barrier.h | 9 ++
include/linux/osq_lock.h | 10 ++
kernel/locking/mutex.c | 6 +-
kernel/locking/osq_lock.c | 279 +++++++++++++++++++++++-------------------
kernel/locking/rwsem-xadd.c | 4 +-
5 files changed, 177 insertions(+), 131 deletions(-)
diff --git a/include/asm-generic/barrier.h b/include/asm-generic/barrier.h
index fe297b599b0a..0036b08151c3 100644
--- a/include/asm-generic/barrier.h
+++ b/include/asm-generic/barrier.h
@@ -221,6 +221,15 @@ do { \
#endif
/**
+ * smp_release__after_ctrl_dep() - Provide RELEASE ordering after a control dependency
+ *
+ * A control dependency provides a LOAD->STORE order, the additional WMB
+ * provides STORE->STORE order, together they provide {LOAD,STORE}->STORE order,
+ * aka. (store)-RELEASE.
+ */
+#define smp_release__after_ctrl_dep() smp_wmb()
+
+/**
* smp_cond_load_acquire() - (Spin) wait for cond with ACQUIRE ordering
* @ptr: pointer to the variable to wait on
* @cond: boolean expression to wait for
diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 703ea5c30a33..a63ffa95aa70 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -29,6 +29,16 @@ static inline void osq_lock_init(struct optimistic_spin_queue *lock)
atomic_set(&lock->tail, OSQ_UNLOCKED_VAL);
}
+/*
+ * Versions of osq_lock/unlock that do not imply or guarantee (load)-ACQUIRE
+ * (store)-RELEASE barrier semantics.
+ *
+ * Note that a failed call to either osq_lock() or osq_lock_relaxed() does
+ * not imply barriers... we are next to block.
+ */
+extern bool osq_lock_relaxed(struct optimistic_spin_queue *lock);
+extern void osq_unlock_relaxed(struct optimistic_spin_queue *lock);
+
extern bool osq_lock(struct optimistic_spin_queue *lock);
extern void osq_unlock(struct optimistic_spin_queue *lock);
diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90db3909..b1bf1e057565 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -316,7 +316,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
* acquire the mutex all at once, the spinners need to take a
* MCS (queued) lock first before spinning on the owner field.
*/
- if (!osq_lock(&lock->osq))
+ if (!osq_lock_relaxed(&lock->osq))
goto done;
while (true) {
@@ -358,7 +358,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
}
mutex_set_owner(lock);
- osq_unlock(&lock->osq);
+ osq_unlock_relaxed(&lock->osq);
return true;
}
@@ -380,7 +380,7 @@ static bool mutex_optimistic_spin(struct mutex *lock,
cpu_relax_lowlatency();
}
- osq_unlock(&lock->osq);
+ osq_unlock_relaxed(&lock->osq);
done:
/*
* If we fell out of the spin path because of need_resched(),
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a37857ab55..d3d1042a509c 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -28,6 +28,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
return per_cpu_ptr(&osq_node, cpu_nr);
}
+static inline void set_node_locked_release(struct optimistic_spin_node *node)
+{
+ smp_store_release(&node->locked, 1);
+}
+
+static inline void set_node_locked_relaxed(struct optimistic_spin_node *node)
+{
+ WRITE_ONCE(node->locked, 1);
+
+}
+
/*
* Get a stable @node->next pointer, either for unlock() or unqueue() purposes.
* Can return NULL in case we were the last queued and we updated @lock instead.
@@ -50,7 +61,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
for (;;) {
if (atomic_read(&lock->tail) = curr &&
- atomic_cmpxchg_acquire(&lock->tail, curr, old) = curr) {
+ atomic_cmpxchg_relaxed(&lock->tail, curr, old) = curr) {
/*
* We were the last queued, we moved @lock back. @prev
* will now observe @lock and will complete its
@@ -70,7 +81,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
* wait for a new @node->next from its Step-C.
*/
if (node->next) {
- next = xchg(&node->next, NULL);
+ next = xchg_relaxed(&node->next, NULL);
if (next)
break;
}
@@ -81,130 +92,146 @@ osq_wait_next(struct optimistic_spin_queue *lock,
return next;
}
-bool osq_lock(struct optimistic_spin_queue *lock)
-{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
- struct optimistic_spin_node *prev, *next;
- int curr = encode_cpu(smp_processor_id());
- int old;
-
- node->locked = 0;
- node->next = NULL;
- node->cpu = curr;
-
- /*
- * We need both ACQUIRE (pairs with corresponding RELEASE in
- * unlock() uncontended, or fastpath) and RELEASE (to publish
- * the node fields we just initialised) semantics when updating
- * the lock tail.
- */
- old = atomic_xchg(&lock->tail, curr);
- if (old = OSQ_UNLOCKED_VAL)
- return true;
-
- prev = decode_cpu(old);
- node->prev = prev;
- WRITE_ONCE(prev->next, node);
-
- /*
- * Normally @prev is untouchable after the above store; because at that
- * moment unlock can proceed and wipe the node element from stack.
- *
- * However, since our nodes are static per-cpu storage, we're
- * guaranteed their existence -- this allows us to apply
- * cmpxchg in an attempt to undo our queueing.
- */
-
- while (!READ_ONCE(node->locked)) {
- /*
- * If we need to reschedule bail... so we can block.
- */
- if (need_resched())
- goto unqueue;
-
- cpu_relax_lowlatency();
- }
- return true;
-
-unqueue:
- /*
- * Step - A -- stabilize @prev
- *
- * Undo our @prev->next assignment; this will make @prev's
- * unlock()/unqueue() wait for a next pointer since @lock points to us
- * (or later).
- */
-
- for (;;) {
- if (prev->next = node &&
- cmpxchg(&prev->next, node, NULL) = node)
- break;
-
- /*
- * We can only fail the cmpxchg() racing against an unlock(),
- * in which case we should observe @node->locked becomming
- * true.
- */
- if (smp_load_acquire(&node->locked))
- return true;
-
- cpu_relax_lowlatency();
-
- /*
- * Or we race against a concurrent unqueue()'s step-B, in which
- * case its step-C will write us a new @node->prev pointer.
- */
- prev = READ_ONCE(node->prev);
- }
-
- /*
- * Step - B -- stabilize @next
- *
- * Similar to unlock(), wait for @node->next or move @lock from @node
- * back to @prev.
- */
-
- next = osq_wait_next(lock, node, prev);
- if (!next)
- return false;
-
- /*
- * Step - C -- unlink
- *
- * @prev is stable because its still waiting for a new @prev->next
- * pointer, @next is stable because our @node->next pointer is NULL and
- * it will wait in Step-A.
- */
-
- WRITE_ONCE(next->prev, prev);
- WRITE_ONCE(prev->next, next);
-
- return false;
+#define OSQ_LOCK(EXT, FENCECB) \
+bool osq_lock##EXT(struct optimistic_spin_queue *lock) \
+{ \
+ struct optimistic_spin_node *node = this_cpu_ptr(&osq_node); \
+ struct optimistic_spin_node *prev, *next; \
+ int old, curr = encode_cpu(smp_processor_id()); \
+ \
+ node->locked = 0; \
+ node->next = NULL; \
+ node->cpu = curr; \
+ \
+ /* \
+ * At the very least we need RELEASE semantics to initialize \
+ * the node fields _before_ publishing it to the the lock tail. \
+ */ \
+ old = atomic_xchg_release(&lock->tail, curr); \
+ if (old = OSQ_UNLOCKED_VAL) { \
+ FENCECB; \
+ return true; \
+ } \
+ \
+ prev = decode_cpu(old); \
+ node->prev = prev; \
+ WRITE_ONCE(prev->next, node); \
+ \
+ /* \
+ * Normally @prev is untouchable after the above store; because \
+ * at that moment unlock can proceed and wipe the node element \
+ * from stack. \
+ * \
+ * However, since our nodes are static per-cpu storage, we're \
+ * guaranteed their existence -- this allows us to apply \
+ * cmpxchg in an attempt to undo our queueing. \
+ */ \
+ while (!READ_ONCE(node->locked)) { \
+ /* \
+ * If we need to reschedule bail... so we can block. \
+ */ \
+ if (need_resched()) \
+ goto unqueue; \
+ \
+ cpu_relax_lowlatency(); \
+ } \
+ FENCECB; \
+ return true; \
+ \
+unqueue: \
+ /* \
+ * Step - A -- stabilize @prev \
+ * \
+ * Undo our @prev->next assignment; this will make @prev's \
+ * unlock()/unqueue() wait for a next pointer since @lock \
+ * points to us (or later). \
+ */ \
+ for (;;) { \
+ /* \
+ * Failed calls to osq_lock() do not guarantee any \
+ * ordering, thus always rely on RELAXED semantics. \
+ * This also applies below, in Step - B. \
+ */ \
+ if (prev->next = node && \
+ cmpxchg_relaxed(&prev->next, node, NULL) = node) \
+ break; \
+ \
+ /* \
+ * We can only fail the cmpxchg() racing against an \
+ * unlock(), in which case we should observe \
+ * @node->locked becoming true. \
+ */ \
+ if (READ_ONCE(node->locked)) { \
+ FENCECB; \
+ return true; \
+ } \
+ \
+ cpu_relax_lowlatency(); \
+ \
+ /* \
+ * Or we race against a concurrent unqueue()'s step-B, \
+ * in which case its step-C will write us a new \
+ * @node->prev pointer. \
+ */ \
+ prev = READ_ONCE(node->prev); \
+ } \
+ \
+ /* \
+ * Step - B -- stabilize @next \
+ * \
+ * Similar to unlock(), wait for @node->next or move @lock \
+ * from @node back to @prev. \
+ */ \
+ next = osq_wait_next(lock, node, prev); \
+ if (!next) \
+ return false; \
+ \
+ /* \
+ * Step - C -- unlink \
+ * \
+ * @prev is stable because its still waiting for a new \
+ * @prev->next pointer, @next is stable because our \
+ * @node->next pointer is NULL and it will wait in Step-A. \
+ */ \
+ WRITE_ONCE(next->prev, prev); \
+ WRITE_ONCE(prev->next, next); \
+ \
+ return false; \
}
-void osq_unlock(struct optimistic_spin_queue *lock)
-{
- struct optimistic_spin_node *node, *next;
- int curr = encode_cpu(smp_processor_id());
-
- /*
- * Fast path for the uncontended case.
- */
- if (likely(atomic_cmpxchg_release(&lock->tail, curr,
- OSQ_UNLOCKED_VAL) = curr))
- return;
-
- /*
- * Second most likely case.
- */
- node = this_cpu_ptr(&osq_node);
- next = xchg(&node->next, NULL);
- if (next) {
- WRITE_ONCE(next->locked, 1);
- return;
- }
-
- next = osq_wait_next(lock, node, NULL);
- if (next)
- WRITE_ONCE(next->locked, 1);
+OSQ_LOCK(, smp_acquire__after_ctrl_dep())
+OSQ_LOCK(_relaxed, )
+
+#define OSQ_UNLOCK(EXT, FENCE, FENCECB) \
+void osq_unlock##EXT(struct optimistic_spin_queue *lock) \
+{ \
+ struct optimistic_spin_node *node, *next; \
+ int curr = encode_cpu(smp_processor_id()); \
+ \
+ /* \
+ * Fast path for the uncontended case. \
+ */ \
+ if (likely(atomic_cmpxchg_##FENCE(&lock->tail, curr, \
+ OSQ_UNLOCKED_VAL) = curr)) \
+ return; \
+ \
+ /* \
+ * Second most likely case. \
+ */ \
+ node = this_cpu_ptr(&osq_node); \
+ next = xchg(&node->next, NULL); \
+ if (next) \
+ goto done_setlocked; \
+ \
+ next = osq_wait_next(lock, node, NULL); \
+ if (!next) { \
+ FENCECB; \
+ return; \
+ } \
+ \
+done_setlocked: \
+ set_node_locked_##FENCE(next); \
}
+
+OSQ_UNLOCK(, release, smp_release__after_ctrl_dep())
+OSQ_UNLOCK(_relaxed, relaxed, )
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4bb2366..88e95b114392 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -389,7 +389,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
if (!rwsem_can_spin_on_owner(sem))
goto done;
- if (!osq_lock(&sem->osq))
+ if (!osq_lock_relaxed(&sem->osq))
goto done;
/*
@@ -425,7 +425,7 @@ static bool rwsem_optimistic_spin(struct rw_semaphore *sem)
*/
cpu_relax_lowlatency();
}
- osq_unlock(&sem->osq);
+ osq_unlock_relaxed(&sem->osq);
done:
preempt_enable();
return taken;
--
2.6.6
^ permalink raw reply related [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-10-09 15:17 ` Christoph Hellwig
@ 2016-10-10 6:07 ` Dave Chinner
2016-10-10 9:34 ` Christoph Hellwig
0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2016-10-10 6:07 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Davidlohr Bueso, Waiman Long, Peter Zijlstra, Ingo Molnar,
linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Jason Low, Jonathan Corbet, Scott J Norton,
Douglas Hatch
On Sun, Oct 09, 2016 at 08:17:48AM -0700, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 08:47:51AM +1100, Dave Chinner wrote:
> > Except that it's DAX, and in 4.7-rc1 that used shared locking at the
> > XFS level and never took exclusive locks.
> >
> > *However*, the DAX IO path locking in XFS has changed in 4.9-rc1 to
> > match the buffered IO single writer POSIX semantics - the test is a
> > bad test based on the fact it exercised a path that is under heavy
> > development and so can't be used as a regression test across
> > multiple kernels.
>
> That being said - I wonder if we should allow the shared lock on DAX
> files IFF the user is specifying O_DIRECT in the open mode..
It should do - if it doesn't then we screwed up the IO path
selection logic in XFS and we'll need to fix it.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-10-10 6:07 ` Dave Chinner
@ 2016-10-10 9:34 ` Christoph Hellwig
2016-10-11 21:06 ` Dave Chinner
0 siblings, 1 reply; 33+ messages in thread
From: Christoph Hellwig @ 2016-10-10 9:34 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Davidlohr Bueso, Waiman Long, Peter Zijlstra,
Ingo Molnar, linux-kernel, x86, linux-alpha, linux-ia64,
linux-s390, linux-arch, linux-doc, Jason Low, Jonathan Corbet,
Scott J Norton, Douglas Hatch
On Mon, Oct 10, 2016 at 05:07:45PM +1100, Dave Chinner wrote:
> > > *However*, the DAX IO path locking in XFS has changed in 4.9-rc1 to
> > > match the buffered IO single writer POSIX semantics - the test is a
> > > bad test based on the fact it exercised a path that is under heavy
> > > development and so can't be used as a regression test across
> > > multiple kernels.
> >
> > That being said - I wonder if we should allow the shared lock on DAX
> > files IFF the user is specifying O_DIRECT in the open mode..
>
> It should do - if it doesn't then we screwed up the IO path
> selection logic in XFS and we'll need to fix it.
Depends on your defintion of "we". The DAX code has always abused the
direct I/O path, and that abuse is ingrained in the VFS path in a way that
we can't easily undo it in XFS, e.g. take a look at io_is_direct and
iocb_flags in include/linux/fs.h, which ensure that DAX I/O will always
appear as IOCB_DIRECT to the fs. It will take some time to untagle
this, but it's on my todo list once the last file system (ext4)
untangles the DAX and direct I/O path.
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-10-10 9:34 ` Christoph Hellwig
@ 2016-10-11 21:06 ` Dave Chinner
2016-10-16 5:57 ` Christoph Hellwig
0 siblings, 1 reply; 33+ messages in thread
From: Dave Chinner @ 2016-10-11 21:06 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Davidlohr Bueso, Waiman Long, Peter Zijlstra, Ingo Molnar,
linux-kernel, x86, linux-alpha, linux-ia64, linux-s390,
linux-arch, linux-doc, Jason Low, Jonathan Corbet, Scott J Norton,
Douglas Hatch
On Mon, Oct 10, 2016 at 02:34:34AM -0700, Christoph Hellwig wrote:
> On Mon, Oct 10, 2016 at 05:07:45PM +1100, Dave Chinner wrote:
> > > > *However*, the DAX IO path locking in XFS has changed in 4.9-rc1 to
> > > > match the buffered IO single writer POSIX semantics - the test is a
> > > > bad test based on the fact it exercised a path that is under heavy
> > > > development and so can't be used as a regression test across
> > > > multiple kernels.
> > >
> > > That being said - I wonder if we should allow the shared lock on DAX
> > > files IFF the user is specifying O_DIRECT in the open mode..
> >
> > It should do - if it doesn't then we screwed up the IO path
> > selection logic in XFS and we'll need to fix it.
>
> Depends on your defintion of "we". The DAX code has always abused the
> direct I/O path, and that abuse is ingrained in the VFS path in a way that
> we can't easily undo it in XFS, e.g. take a look at io_is_direct and
> iocb_flags in include/linux/fs.h, which ensure that DAX I/O will always
> appear as IOCB_DIRECT to the fs.
Um, I seem to have completely missed that change - when did that
happen and why?
Oh, it was part of the misguided "enable DAX on block devices"
changes - it was supposed to fix a problem with block device access
using buffered IO instead of DAX (commit 65f87ee71852 "fs, block:
force direct-I/O for dax-enabled block devices")). The original
block device DAX code was reverted soon after this because it was
badly flawed, but this change was not removed at the same time
(probably because it was forgotton about).
Hence I'd suggest that DAX check in io_is_direct() should be removed
ASAP; the filesystems don't need it as they check the inode DAX
state directly, and the code it "fixed" is no longer in the tree.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP
2016-10-11 21:06 ` Dave Chinner
@ 2016-10-16 5:57 ` Christoph Hellwig
0 siblings, 0 replies; 33+ messages in thread
From: Christoph Hellwig @ 2016-10-16 5:57 UTC (permalink / raw)
To: Dave Chinner
Cc: Christoph Hellwig, Davidlohr Bueso, Waiman Long, Peter Zijlstra,
Ingo Molnar, linux-kernel, x86, linux-alpha, linux-ia64,
linux-s390, linux-arch, linux-doc, Jason Low, Jonathan Corbet,
Scott J Norton, Douglas Hatch
On Wed, Oct 12, 2016 at 08:06:40AM +1100, Dave Chinner wrote:
> Um, I seem to have completely missed that change - when did that
> happen and why?
>
> Oh, it was part of the misguided "enable DAX on block devices"
> changes -
o, that commit just switched it to use ->f_mapping:
- return (filp->f_flags & O_DIRECT) || IS_DAX(file_inode(filp));
+ return (filp->f_flags & O_DIRECT) || IS_DAX(filp->f_mapping->host);
The original version of it goes all the way back to introducing the
current-day DAX code in d475c6346 ("dax,ext2: replace XIP read and write
with DAX I/O");
> Hence I'd suggest that DAX check in io_is_direct() should be removed
> ASAP; the filesystems don't need it as they check the inode DAX
> state directly, and the code it "fixed" is no longer in the tree.
As long as ext4 still uses the overloaded direct_IO we need the
checks for DAX in the filemap.c generic read/write code. It seems
like that's only two spots anyway, but I'd feel much safer once ext4
is switched over to the iomap version of the dax code.
^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2016-10-16 5:57 UTC | newest]
Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-08-18 21:11 [RFC PATCH-tip v4 00/10] locking/rwsem: Enable reader optimistic spinning Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier Waiman Long
2016-10-04 19:06 ` Davidlohr Bueso
2016-10-04 21:28 ` Jason Low
2016-10-05 12:19 ` Waiman Long
2016-10-05 15:11 ` Waiman Long
2016-10-06 5:47 ` Davidlohr Bueso
2016-10-06 19:30 ` Waiman Long
2016-10-10 5:39 ` [PATCH] locking/osq: Provide proper lock/unlock and relaxed flavors Davidlohr Bueso
2016-10-06 19:31 ` [RFC PATCH-tip v4 01/10] locking/osq: Make lock/unlock proper acquire/release barrier Jason Low
2016-08-18 21:11 ` [RFC PATCH-tip v4 02/10] locking/rwsem: Stop active read lock ASAP Waiman Long
2016-10-06 18:17 ` Davidlohr Bueso
2016-10-06 21:47 ` Dave Chinner
2016-10-06 22:51 ` Davidlohr Bueso
2016-10-07 21:45 ` Waiman Long
2016-10-09 15:17 ` Christoph Hellwig
2016-10-10 6:07 ` Dave Chinner
2016-10-10 9:34 ` Christoph Hellwig
2016-10-11 21:06 ` Dave Chinner
2016-10-16 5:57 ` Christoph Hellwig
2016-08-18 21:11 ` [RFC PATCH-tip v4 03/10] locking/rwsem: Make rwsem_spin_on_owner() return a tri-state value Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 04/10] locking/rwsem: Enable count-based spinning on reader Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 05/10] locking/rwsem: move down rwsem_down_read_failed function Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 06/10] locking/rwsem: Move common rwsem macros to asm-generic/rwsem_types.h Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 07/10] locking/rwsem: Change RWSEM_WAITING_BIAS for better disambiguation Waiman Long
2016-08-19 5:57 ` Wanpeng Li
2016-08-19 16:21 ` Waiman Long
2016-08-22 2:15 ` Wanpeng Li
2016-08-18 21:11 ` [RFC PATCH-tip v4 08/10] locking/rwsem: Enable spinning readers Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 09/10] locking/rwsem: Enable reactivation of reader spinning Waiman Long
2016-08-18 21:11 ` [RFC PATCH-tip v4 10/10] locking/rwsem: Add a boot parameter to reader spinning threshold Waiman Long
2016-08-24 4:00 ` Davidlohr Bueso
2016-08-24 18:39 ` Waiman Long
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).