* [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c
2023-12-29 20:51 [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code David Laight
@ 2023-12-29 20:53 ` David Laight
2023-12-30 1:59 ` Waiman Long
2023-12-29 20:54 ` [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path David Laight
` (5 subsequent siblings)
6 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2023-12-29 20:53 UTC (permalink / raw)
To: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'longman@redhat.com'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
struct optimistic_spin_node is private to the implementation.
Move it into the C file to ensure nothing is accessing it.
Signed-off-by: David Laight <david.laight@aculab.com>
---
include/linux/osq_lock.h | 5 -----
kernel/locking/osq_lock.c | 7 +++++++
2 files changed, 7 insertions(+), 5 deletions(-)
diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
index 5581dbd3bd34..ea8fb31379e3 100644
--- a/include/linux/osq_lock.h
+++ b/include/linux/osq_lock.h
@@ -6,11 +6,6 @@
* An MCS like lock especially tailored for optimistic spinning for sleeping
* lock implementations (mutex, rwsem, etc).
*/
-struct optimistic_spin_node {
- struct optimistic_spin_node *next, *prev;
- int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
-};
struct optimistic_spin_queue {
/*
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d5610ad52b92..d414eef4bec6 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -11,6 +11,13 @@
* called from interrupt context and we have preemption disabled while
* spinning.
*/
+
+struct optimistic_spin_node {
+ struct optimistic_spin_node *next, *prev;
+ int locked; /* 1 if lock acquired */
+ int cpu; /* encoded CPU # + 1 value */
+};
+
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
/*
--
2.17.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c
2023-12-29 20:53 ` [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c David Laight
@ 2023-12-30 1:59 ` Waiman Long
0 siblings, 0 replies; 30+ messages in thread
From: Waiman Long @ 2023-12-30 1:59 UTC (permalink / raw)
To: David Laight, 'linux-kernel@vger.kernel.org',
'peterz@infradead.org'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
On 12/29/23 15:53, David Laight wrote:
> struct optimistic_spin_node is private to the implementation.
> Move it into the C file to ensure nothing is accessing it.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
> include/linux/osq_lock.h | 5 -----
> kernel/locking/osq_lock.c | 7 +++++++
> 2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/osq_lock.h b/include/linux/osq_lock.h
> index 5581dbd3bd34..ea8fb31379e3 100644
> --- a/include/linux/osq_lock.h
> +++ b/include/linux/osq_lock.h
> @@ -6,11 +6,6 @@
> * An MCS like lock especially tailored for optimistic spinning for sleeping
> * lock implementations (mutex, rwsem, etc).
> */
> -struct optimistic_spin_node {
> - struct optimistic_spin_node *next, *prev;
> - int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> -};
>
> struct optimistic_spin_queue {
> /*
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52b92..d414eef4bec6 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -11,6 +11,13 @@
> * called from interrupt context and we have preemption disabled while
> * spinning.
> */
> +
> +struct optimistic_spin_node {
> + struct optimistic_spin_node *next, *prev;
> + int locked; /* 1 if lock acquired */
> + int cpu; /* encoded CPU # + 1 value */
> +};
> +
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
>
> /*
Please correct the patch title "osf_lock.c" => "osq_lock.c".
After the fix, you can add
Acked-by: Waiman Long <longman@redhat.com>
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
2023-12-29 20:51 [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code David Laight
2023-12-29 20:53 ` [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c David Laight
@ 2023-12-29 20:54 ` David Laight
2023-12-29 20:56 ` [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next() David Laight
` (4 subsequent siblings)
6 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2023-12-29 20:54 UTC (permalink / raw)
To: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'longman@redhat.com'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.
node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).
Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.
Should reduce cache-line bouncing a little.
Signed-off-by: David Laight <david.laight@aculab.com>
---
kernel/locking/osq_lock.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
struct optimistic_spin_node *prev)
{
struct optimistic_spin_node *next = NULL;
- int curr = encode_cpu(smp_processor_id());
+ int curr = node->cpu;
int old;
/*
@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
- int curr = encode_cpu(smp_processor_id());
int old;
- node->locked = 0;
- node->next = NULL;
- node->cpu = curr;
+ if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+ node->cpu = encode_cpu(smp_processor_id());
/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* the node fields we just initialised) semantics when updating
* the lock tail.
*/
- old = atomic_xchg(&lock->tail, curr);
+ old = atomic_xchg(&lock->tail, node->cpu);
if (old == OSQ_UNLOCKED_VAL)
return true;
prev = decode_cpu(old);
node->prev = prev;
+ node->locked = 0;
/*
* osq_lock() unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
void osq_unlock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node, *next;
- int curr = encode_cpu(smp_processor_id());
+ int curr = raw_cpu_read(osq_node.cpu);
/*
* Fast path for the uncontended case.
--
2.17.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply related [flat|nested] 30+ messages in thread* [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()
2023-12-29 20:51 [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code David Laight
2023-12-29 20:53 ` [PATCH next 1/5] locking/osq_lock: Move the definition of optimistic_spin_node into osf_lock.c David Laight
2023-12-29 20:54 ` [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path David Laight
@ 2023-12-29 20:56 ` David Laight
2023-12-29 22:54 ` Linus Torvalds
2023-12-30 2:54 ` Waiman Long
2023-12-29 20:57 ` [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses David Laight
` (3 subsequent siblings)
6 siblings, 2 replies; 30+ messages in thread
From: David Laight @ 2023-12-29 20:56 UTC (permalink / raw)
To: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'longman@redhat.com'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
but only needs the 'cpu' value to write to lock->tail.
Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
Also directly return NULL or 'next' instead of breaking the loop.
Should have no effect on the generated code since gcc manages to
assume that 'prev != NULL' due to an earlier dereference.
Signed-off-by: David Laight <david.laight@aculab.com>
---
kernel/locking/osq_lock.c | 23 ++++++++++-------------
1 file changed, 10 insertions(+), 13 deletions(-)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 55f5db896c02..9bb3a077ba92 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
static inline struct optimistic_spin_node *
osq_wait_next(struct optimistic_spin_queue *lock,
struct optimistic_spin_node *node,
- struct optimistic_spin_node *prev)
+ int old)
{
- struct optimistic_spin_node *next = NULL;
+ struct optimistic_spin_node *next;
int curr = node->cpu;
- int old;
/*
- * If there is a prev node in queue, then the 'old' value will be
- * the prev node's CPU #, else it's set to OSQ_UNLOCKED_VAL since if
- * we're currently last in queue, then the queue will then become empty.
+ * If osq_lock() is being cancelled there must be a previous node
+ * and 'old' is its CPU #.
+ * For osq_unlock() there is never a previous node and old is set
+ * to OSQ_UNLOCKED_VAL.
*/
- old = prev ? prev->cpu : OSQ_UNLOCKED_VAL;
for (;;) {
if (atomic_read(&lock->tail) == curr &&
@@ -69,7 +68,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
* will now observe @lock and will complete its
* unlock()/unqueue().
*/
- break;
+ return NULL;
}
/*
@@ -85,13 +84,11 @@ osq_wait_next(struct optimistic_spin_queue *lock,
if (node->next) {
next = xchg(&node->next, NULL);
if (next)
- break;
+ return next;
}
cpu_relax();
}
-
- return next;
}
bool osq_lock(struct optimistic_spin_queue *lock)
@@ -192,7 +189,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* back to @prev.
*/
- next = osq_wait_next(lock, node, prev);
+ next = osq_wait_next(lock, node, prev->cpu);
if (!next)
return false;
@@ -232,7 +229,7 @@ void osq_unlock(struct optimistic_spin_queue *lock)
return;
}
- next = osq_wait_next(lock, node, NULL);
+ next = osq_wait_next(lock, node, OSQ_UNLOCKED_VAL);
if (next)
WRITE_ONCE(next->locked, 1);
}
--
2.17.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()
2023-12-29 20:56 ` [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next() David Laight
@ 2023-12-29 22:54 ` Linus Torvalds
2023-12-30 2:54 ` Waiman Long
1 sibling, 0 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-12-29 22:54 UTC (permalink / raw)
To: David Laight
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
longman@redhat.com, mingo@redhat.com, will@kernel.org,
boqun.feng@gmail.com, xinhui.pan@linux.vnet.ibm.com,
virtualization@lists.linux-foundation.org, Zeng Heng
On Fri, 29 Dec 2023 at 12:56, David Laight <David.Laight@aculab.com> wrote:
>
> osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
> but only needs the 'cpu' value to write to lock->tail.
> Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
>
> Also directly return NULL or 'next' instead of breaking the loop.
Please split these two totally independent things out of the patch,
just to make things much more obvious.
I like the new calling convention, but I don't like how the patch
isn't obviously just that.
In fact, I'd take your patch #1 and just the calling convention change
from #3 as "these are obviously not changing anything at all, only
moving things to more local places".
I'd also take the other part of #3 as a "clearly doesn't change
anything" but it should be a separate patch, and it should be done
differently: make 'next' be local to just *inside* the for-loop (in
fact, make it local to the if-statement that sets it), to clarify the
whole thing that it can never be non-NULL at the top of the loop, and
can never have any long-term semantics.
The other parts actually change some logic, and would need the OSQ
people to take a more serious look.
Linus
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()
2023-12-29 20:56 ` [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next() David Laight
2023-12-29 22:54 ` Linus Torvalds
@ 2023-12-30 2:54 ` Waiman Long
1 sibling, 0 replies; 30+ messages in thread
From: Waiman Long @ 2023-12-30 2:54 UTC (permalink / raw)
To: David Laight, 'linux-kernel@vger.kernel.org',
'peterz@infradead.org'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
On 12/29/23 15:56, David Laight wrote:
> osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
> but only needs the 'cpu' value to write to lock->tail.
> Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
>
> Also directly return NULL or 'next' instead of breaking the loop.
>
> Should have no effect on the generated code since gcc manages to
> assume that 'prev != NULL' due to an earlier dereference.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
> kernel/locking/osq_lock.c | 23 ++++++++++-------------
> 1 file changed, 10 insertions(+), 13 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 55f5db896c02..9bb3a077ba92 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -48,18 +48,17 @@ static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> static inline struct optimistic_spin_node *
> osq_wait_next(struct optimistic_spin_queue *lock,
> struct optimistic_spin_node *node,
> - struct optimistic_spin_node *prev)
> + int old)
Make the last argument name more descriptive, like "old_cpu" as the
"int" type does not provide enough context to allow people to guess what
"old" may be.
Cheers,
Longman
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-29 20:51 [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code David Laight
` (2 preceding siblings ...)
2023-12-29 20:56 ` [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next() David Laight
@ 2023-12-29 20:57 ` David Laight
2023-12-30 3:08 ` Waiman Long
` (2 more replies)
2023-12-29 20:58 ` [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check David Laight
` (2 subsequent siblings)
6 siblings, 3 replies; 30+ messages in thread
From: David Laight @ 2023-12-29 20:57 UTC (permalink / raw)
To: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'longman@redhat.com'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
this_cpu_ptr() is rather more expensive than raw_cpu_read() since
the latter can use an 'offset from register' (%gs for x86-84).
Add a 'self' field to 'struct optimistic_spin_node' that can be
read with raw_cpu_read(), initialise on first call.
Signed-off-by: David Laight <david.laight@aculab.com>
---
kernel/locking/osq_lock.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 9bb3a077ba92..b60b0add0161 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -13,7 +13,7 @@
*/
struct optimistic_spin_node {
- struct optimistic_spin_node *next, *prev;
+ struct optimistic_spin_node *self, *next, *prev;
int locked; /* 1 if lock acquired */
int cpu; /* encoded CPU # + 1 value */
};
@@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
bool osq_lock(struct optimistic_spin_queue *lock)
{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+ struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
struct optimistic_spin_node *prev, *next;
int old;
- if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
- node->cpu = encode_cpu(smp_processor_id());
+ if (unlikely(!node)) {
+ int cpu = encode_cpu(smp_processor_id());
+ node = decode_cpu(cpu);
+ node->self = node;
+ node->cpu = cpu;
+ }
/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -222,7 +226,7 @@ void osq_unlock(struct optimistic_spin_queue *lock)
/*
* Second most likely case.
*/
- node = this_cpu_ptr(&osq_node);
+ node = raw_cpu_read(osq_node.self);
next = xchg(&node->next, NULL);
if (next) {
WRITE_ONCE(next->locked, 1);
--
2.17.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-29 20:57 ` [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses David Laight
@ 2023-12-30 3:08 ` Waiman Long
2023-12-30 11:09 ` Ingo Molnar
2023-12-30 20:41 ` Linus Torvalds
[not found] ` <ZZB/jIvKgKQ2sV7M@gmail.com>
2 siblings, 1 reply; 30+ messages in thread
From: Waiman Long @ 2023-12-30 3:08 UTC (permalink / raw)
To: David Laight, 'linux-kernel@vger.kernel.org',
'peterz@infradead.org'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
On 12/29/23 15:57, David Laight wrote:
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
> kernel/locking/osq_lock.c | 14 +++++++++-----
> 1 file changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index 9bb3a077ba92..b60b0add0161 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -13,7 +13,7 @@
> */
>
> struct optimistic_spin_node {
> - struct optimistic_spin_node *next, *prev;
> + struct optimistic_spin_node *self, *next, *prev;
> int locked; /* 1 if lock acquired */
> int cpu; /* encoded CPU # + 1 value */
> };
> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
My gcc 11 compiler produces the following x86-64 code:
92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
0x0000000000000029 <+25>: mov %rcx,%rdx
0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34
<osq_lock+36>
Which looks pretty optimized for me. Maybe older compiler may generate
more complex code. However, I do have some doubt as to the benefit of
this patch at the expense of making the code a bit more complex.
Cheers,
Longman
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-30 3:08 ` Waiman Long
@ 2023-12-30 11:09 ` Ingo Molnar
2023-12-30 11:35 ` David Laight
0 siblings, 1 reply; 30+ messages in thread
From: Ingo Molnar @ 2023-12-30 11:09 UTC (permalink / raw)
To: Waiman Long
Cc: David Laight, 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'mingo@redhat.com',
'will@kernel.org', 'boqun.feng@gmail.com',
'Linus Torvalds', 'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
* Waiman Long <longman@redhat.com> wrote:
> On 12/29/23 15:57, David Laight wrote:
> > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > the latter can use an 'offset from register' (%gs for x86-84).
> >
> > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > read with raw_cpu_read(), initialise on first call.
> >
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> > kernel/locking/osq_lock.c | 14 +++++++++-----
> > 1 file changed, 9 insertions(+), 5 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index 9bb3a077ba92..b60b0add0161 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -13,7 +13,7 @@
> > */
> > struct optimistic_spin_node {
> > - struct optimistic_spin_node *next, *prev;
> > + struct optimistic_spin_node *self, *next, *prev;
> > int locked; /* 1 if lock acquired */
> > int cpu; /* encoded CPU # + 1 value */
> > };
> > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > bool osq_lock(struct optimistic_spin_queue *lock)
> > {
> > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>
> My gcc 11 compiler produces the following x86-64 code:
>
> 92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> 0x0000000000000029 <+25>: mov %rcx,%rdx
> 0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34
> <osq_lock+36>
>
> Which looks pretty optimized for me. Maybe older compiler may generate more
> complex code. However, I do have some doubt as to the benefit of this patch
> at the expense of making the code a bit more complex.
GCC-11 is plenty of a look-back window in terms of compiler efficiency:
latest enterprise distros use GCC-11 or newer, while recent desktop
distros use GCC-13. Anything older won't matter, because no major
distribution is going to use new kernels with old compilers.
Thanks,
Ingo
^ permalink raw reply [flat|nested] 30+ messages in thread* RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-30 11:09 ` Ingo Molnar
@ 2023-12-30 11:35 ` David Laight
2023-12-31 3:04 ` Waiman Long
0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2023-12-30 11:35 UTC (permalink / raw)
To: 'Ingo Molnar', Waiman Long
Cc: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'mingo@redhat.com',
'will@kernel.org', 'boqun.feng@gmail.com',
'Linus Torvalds', 'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
From: Ingo Molnar
> Sent: 30 December 2023 11:09
>
>
> * Waiman Long <longman@redhat.com> wrote:
>
> > On 12/29/23 15:57, David Laight wrote:
> > > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > > the latter can use an 'offset from register' (%gs for x86-84).
> > >
> > > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > > read with raw_cpu_read(), initialise on first call.
> > >
> > > Signed-off-by: David Laight <david.laight@aculab.com>
> > > ---
> > > kernel/locking/osq_lock.c | 14 +++++++++-----
> > > 1 file changed, 9 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > > index 9bb3a077ba92..b60b0add0161 100644
> > > --- a/kernel/locking/osq_lock.c
> > > +++ b/kernel/locking/osq_lock.c
> > > @@ -13,7 +13,7 @@
> > > */
> > > struct optimistic_spin_node {
> > > - struct optimistic_spin_node *next, *prev;
> > > + struct optimistic_spin_node *self, *next, *prev;
> > > int locked; /* 1 if lock acquired */
> > > int cpu; /* encoded CPU # + 1 value */
> > > };
> > > @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > > bool osq_lock(struct optimistic_spin_queue *lock)
> > > {
> > > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> >
> > My gcc 11 compiler produces the following x86-64 code:
> >
> > 92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > 0x0000000000000029 <+25>: mov %rcx,%rdx
> > 0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34
> > <osq_lock+36>
> >
> > Which looks pretty optimized for me. Maybe older compiler may generate more
> > complex code. However, I do have some doubt as to the benefit of this patch
> > at the expense of making the code a bit more complex.
My changed code is one instruction shorter!
18: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # 20 <osq_lock+0x20>
1f: 00
1c: R_X86_64_PC32 .data..percpu..shared_aligned-0x4
However is might have one less cache line miss.
> GCC-11 is plenty of a look-back window in terms of compiler efficiency:
> latest enterprise distros use GCC-11 or newer, while recent desktop
> distros use GCC-13. Anything older won't matter, because no major
> distribution is going to use new kernels with old compilers.
There must be a difference in the header files as well.
Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
But maybe based on some config option.
I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
which necessitates an array lookup (indexed by cpu number).
Whereas I think you are seeing it implemented as
raw_cpu_read(per_cpu_data_base) + offset_to(xxx)
So the old code generates (after the prologue):
10: 49 89 fd mov %rdi,%r13
13: 49 c7 c4 00 00 00 00 mov $0x0,%r12
16: R_X86_64_32S .data..percpu..shared_aligned
1a: e8 00 00 00 00 callq 1f <osq_lock+0x1f>
1b: R_X86_64_PC32 debug_smp_processor_id-0x4
1f: 89 c0 mov %eax,%eax
21: 48 8b 1c c5 00 00 00 mov 0x0(,%rax,8),%rbx
28: 00
25: R_X86_64_32S __per_cpu_offset
29: e8 00 00 00 00 callq 2e <osq_lock+0x2e>
2a: R_X86_64_PC32 debug_smp_processor_id-0x4
2e: 4c 01 e3 add %r12,%rbx
31: 83 c0 01 add $0x1,%eax
34: c7 43 10 00 00 00 00 movl $0x0,0x10(%rbx)
3b: 48 c7 03 00 00 00 00 movq $0x0,(%rbx)
42: 89 43 14 mov %eax,0x14(%rbx)
45: 41 87 45 00 xchg %eax,0x0(%r13)
I was also surprised that smp_processor_id() is a real function rather
than an offset from %gs.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-30 11:35 ` David Laight
@ 2023-12-31 3:04 ` Waiman Long
2023-12-31 10:36 ` David Laight
0 siblings, 1 reply; 30+ messages in thread
From: Waiman Long @ 2023-12-31 3:04 UTC (permalink / raw)
To: David Laight, 'Ingo Molnar'
Cc: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'mingo@redhat.com',
'will@kernel.org', 'boqun.feng@gmail.com',
'Linus Torvalds', 'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
On 12/30/23 06:35, David Laight wrote:
> From: Ingo Molnar
>> Sent: 30 December 2023 11:09
>>
>>
>> * Waiman Long <longman@redhat.com> wrote:
>>
>>> On 12/29/23 15:57, David Laight wrote:
>>>> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
>>>> the latter can use an 'offset from register' (%gs for x86-84).
>>>>
>>>> Add a 'self' field to 'struct optimistic_spin_node' that can be
>>>> read with raw_cpu_read(), initialise on first call.
>>>>
>>>> Signed-off-by: David Laight <david.laight@aculab.com>
>>>> ---
>>>> kernel/locking/osq_lock.c | 14 +++++++++-----
>>>> 1 file changed, 9 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>>>> index 9bb3a077ba92..b60b0add0161 100644
>>>> --- a/kernel/locking/osq_lock.c
>>>> +++ b/kernel/locking/osq_lock.c
>>>> @@ -13,7 +13,7 @@
>>>> */
>>>> struct optimistic_spin_node {
>>>> - struct optimistic_spin_node *next, *prev;
>>>> + struct optimistic_spin_node *self, *next, *prev;
>>>> int locked; /* 1 if lock acquired */
>>>> int cpu; /* encoded CPU # + 1 value */
>>>> };
>>>> @@ -93,12 +93,16 @@ osq_wait_next(struct optimistic_spin_queue *lock,
>>>> bool osq_lock(struct optimistic_spin_queue *lock)
>>>> {
>>>> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>>>> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>>> My gcc 11 compiler produces the following x86-64 code:
>>>
>>> 92 struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
>>> 0x0000000000000029 <+25>: mov %rcx,%rdx
>>> 0x000000000000002c <+28>: add %gs:0x0(%rip),%rdx # 0x34
>>> <osq_lock+36>
>>>
>>> Which looks pretty optimized for me. Maybe older compiler may generate more
>>> complex code. However, I do have some doubt as to the benefit of this patch
>>> at the expense of making the code a bit more complex.
> My changed code is one instruction shorter!
> 18: 65 48 8b 15 00 00 00 mov %gs:0x0(%rip),%rdx # 20 <osq_lock+0x20>
> 1f: 00
> 1c: R_X86_64_PC32 .data..percpu..shared_aligned-0x4
> However is might have one less cache line miss.
>
>> GCC-11 is plenty of a look-back window in terms of compiler efficiency:
>> latest enterprise distros use GCC-11 or newer, while recent desktop
>> distros use GCC-13. Anything older won't matter, because no major
>> distribution is going to use new kernels with old compilers.
> There must be a difference in the header files as well.
> Possibly forced by the older compiler I'm using (7.5 from Ubuntu 18.04).
> But maybe based on some config option.
>
> I'm seeing this_cpu_ptr(&xxx) converted to per_cpu_ptr(&xxx, smp_processor_id())
> which necessitates an array lookup (indexed by cpu number).
> Whereas I think you are seeing it implemented as
> raw_cpu_read(per_cpu_data_base) + offset_to(xxx)
>
> So the old code generates (after the prologue):
> 10: 49 89 fd mov %rdi,%r13
> 13: 49 c7 c4 00 00 00 00 mov $0x0,%r12
> 16: R_X86_64_32S .data..percpu..shared_aligned
> 1a: e8 00 00 00 00 callq 1f <osq_lock+0x1f>
> 1b: R_X86_64_PC32 debug_smp_processor_id-0x4
> 1f: 89 c0 mov %eax,%eax
> 21: 48 8b 1c c5 00 00 00 mov 0x0(,%rax,8),%rbx
> 28: 00
> 25: R_X86_64_32S __per_cpu_offset
> 29: e8 00 00 00 00 callq 2e <osq_lock+0x2e>
> 2a: R_X86_64_PC32 debug_smp_processor_id-0x4
> 2e: 4c 01 e3 add %r12,%rbx
> 31: 83 c0 01 add $0x1,%eax
> 34: c7 43 10 00 00 00 00 movl $0x0,0x10(%rbx)
> 3b: 48 c7 03 00 00 00 00 movq $0x0,(%rbx)
> 42: 89 43 14 mov %eax,0x14(%rbx)
> 45: 41 87 45 00 xchg %eax,0x0(%r13)
>
> I was also surprised that smp_processor_id() is a real function rather
> than an offset from %gs.
I have looked up definition of this_cpu_ptr() and gotten the following
results:
this_cpu_ptr() => raw_cpu_ptr() => arch_raw_cpu_ptr()
/*
* Compared to the generic __my_cpu_offset version, the following
* saves one instruction and avoids clobbering a temp register.
*/
#define arch_raw_cpu_ptr(ptr) \
({ \
unsigned long tcp_ptr__; \
asm ("add " __percpu_arg(1) ", %0" \
: "=r" (tcp_ptr__) \
: "m" (this_cpu_off), "0" (ptr)); \
(typeof(*(ptr)) __kernel __force *)tcp_ptr__; \
})
The presence of debug_smp_processor_id in your compiled code is likely
due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.
#ifdef CONFIG_DEBUG_PREEMPT
extern unsigned int debug_smp_processor_id(void);
# define smp_processor_id() debug_smp_processor_id()
#else
# define smp_processor_id() __smp_processor_id()
#endif
I don't have that config entry in my kernel config and so I only get 2
instructions for this_cpu_ptr(). We are not going to optimize the code
specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.
Cheers,
Longman
^ permalink raw reply [flat|nested] 30+ messages in thread* RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-31 3:04 ` Waiman Long
@ 2023-12-31 10:36 ` David Laight
0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2023-12-31 10:36 UTC (permalink / raw)
To: 'Waiman Long', 'Ingo Molnar'
Cc: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'mingo@redhat.com',
'will@kernel.org', 'boqun.feng@gmail.com',
'Linus Torvalds', 'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
From: Waiman Long
> Sent: 31 December 2023 03:04
....
> The presence of debug_smp_processor_id in your compiled code is likely
> due to the setting of CONFIG_DEBUG_PREEMPT in your kernel config.
>
> #ifdef CONFIG_DEBUG_PREEMPT
> extern unsigned int debug_smp_processor_id(void);
> # define smp_processor_id() debug_smp_processor_id()
> #else
> # define smp_processor_id() __smp_processor_id()
> #endif
>
> I don't have that config entry in my kernel config and so I only get 2
> instructions for this_cpu_ptr(). We are not going to optimize the code
> specifically for CONFIG_DEBUG_PREEMPT and so this patch should be dropped.
Yes, I discovered that late last night.
I've no idea why it is set.
It might even be inherited from the ubuntu 18.04 (I think) .config
I started with (mostly removing extra modules to reduce compile
time and the size of uramdisk).
I'll look at the patches again.
Saving node->prev->cpu as node->prev_cpu will definitely save
the cache-line bounce.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-29 20:57 ` [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses David Laight
2023-12-30 3:08 ` Waiman Long
@ 2023-12-30 20:41 ` Linus Torvalds
2023-12-30 20:59 ` Linus Torvalds
2023-12-31 11:41 ` David Laight
[not found] ` <ZZB/jIvKgKQ2sV7M@gmail.com>
2 siblings, 2 replies; 30+ messages in thread
From: Linus Torvalds @ 2023-12-30 20:41 UTC (permalink / raw)
To: David Laight
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
longman@redhat.com, mingo@redhat.com, will@kernel.org,
boqun.feng@gmail.com, xinhui.pan@linux.vnet.ibm.com,
virtualization@lists.linux-foundation.org, Zeng Heng
[-- Attachment #1: Type: text/plain, Size: 3129 bytes --]
On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@aculab.com> wrote:
>
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.
No, this is horrible.
The problem isn't the "this_cpu_ptr()", it's the rest of the code.
> bool osq_lock(struct optimistic_spin_queue *lock)
> {
> - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
No. Both of these are crap.
> struct optimistic_spin_node *prev, *next;
> int old;
>
> - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> - node->cpu = encode_cpu(smp_processor_id());
> + if (unlikely(!node)) {
> + int cpu = encode_cpu(smp_processor_id());
> + node = decode_cpu(cpu);
> + node->self = node;
> + node->cpu = cpu;
> + }
The proper fix here is to not do that silly
node = this_cpu_ptr(&osq_node);
..
node->next = NULL;
dance at all, but to simply do
this_cpu_write(osq_node.next, NULL);
in the first place. That makes the whole thing just a single store off
the segment descriptor.
Yes, you'll eventually end up doing that
node = this_cpu_ptr(&osq_node);
thing because it then wants to use that raw pointer to do
WRITE_ONCE(prev->next, node);
but that's a separate issue and still does not make it worth it to
create a pointless self-pointer.
Btw, if you *really* want to solve that separate issue, then make the
optimistic_spin_node struct not contain the pointers at all, but the
CPU numbers, and then turn those numbers into the pointers the exact
same way it does for the "lock->tail" thing, ie doing that whole
prev = decode_cpu(old);
dance. That *may* then result in avoiding turning them into pointers
at all in some cases.
Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
be -1 instead, and add something like
#define IS_OSQ_UNLOCKED(x) ((int)(x)<0)
and that would then avoid the +1 / -1 games in encoding/decoding the
CPU numbers. It causes silly code generated like this:
subl $1, %eax #, cpu_nr
...
cltq
addq __per_cpu_offset(,%rax,8), %rcx
which seems honestly stupid. The cltq is there for sign-extension,
which is because all these things are "int", and the "subl" will
zero-extend to 64-bit, not sign-extend.
At that point, I think gcc might be able to just generate
addq __per_cpu_offset-8(,%rax,8), %rcx
but honestly, I think it would be nicer to just have decode_cpu() do
unsigned int cpu_nr = encoded_cpu_val;
return per_cpu_ptr(&osq_node, cpu_nr);
and not have the -1/+1 at all.
Hmm?
UNTESTED patch to just do the "this_cpu_write()" parts attached.
Again, note how we do end up doing that this_cpu_ptr conversion later
anyway, but at least it's off the critical path.
Linus
[-- Attachment #2: patch.diff --]
[-- Type: text/x-patch, Size: 1083 bytes --]
kernel/locking/osq_lock.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 75a6f6133866..c3a166b7900c 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -92,14 +92,14 @@ osq_wait_next(struct optimistic_spin_queue *lock,
bool osq_lock(struct optimistic_spin_queue *lock)
{
- struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
+ struct optimistic_spin_node *node;
struct optimistic_spin_node *prev, *next;
int curr = encode_cpu(smp_processor_id());
int old;
- node->locked = 0;
- node->next = NULL;
- node->cpu = curr;
+ this_cpu_write(osq_node.next, NULL);
+ this_cpu_write(osq_node.locked, 0);
+ this_cpu_write(osq_node.cpu, curr);
/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -112,7 +112,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
return true;
prev = decode_cpu(old);
- node->prev = prev;
+ this_cpu_write(osq_node.prev, prev);
+
+ node = this_cpu_ptr(&osq_node);
/*
* osq_lock() unqueue
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-30 20:41 ` Linus Torvalds
@ 2023-12-30 20:59 ` Linus Torvalds
2023-12-31 11:56 ` David Laight
2023-12-31 11:41 ` David Laight
1 sibling, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2023-12-30 20:59 UTC (permalink / raw)
To: David Laight
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
longman@redhat.com, mingo@redhat.com, will@kernel.org,
boqun.feng@gmail.com, xinhui.pan@linux.vnet.ibm.com,
virtualization@lists.linux-foundation.org, Zeng Heng
On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> UNTESTED patch to just do the "this_cpu_write()" parts attached.
> Again, note how we do end up doing that this_cpu_ptr conversion later
> anyway, but at least it's off the critical path.
Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
code, it really is still better than caching a value in memory.
At least the memory location that 'this_cpu_ptr()' accesses is
slightly more likely to be hot (and is right next to the cpu number,
iirc).
That said, I think we should fix this_cpu_ptr() to not ever generate
that disgusting cltq just because the cpu pointer has the wrong
signedness. I don't quite know how to do it, but this:
-#define per_cpu_offset(x) (__per_cpu_offset[x])
+#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])
at least helps a *bit*. It gets rid of the cltq, at least, but if
somebody actually passes in an 'unsigned long' cpuid, it would cause
an unnecessary truncation.
And gcc still generates
subl $1, %eax #, cpu_nr
addq __per_cpu_offset(,%rax,8), %rcx
instead of just doing
addq __per_cpu_offset-8(,%rax,8), %rcx
because it still needs to clear the upper 32 bits and doesn't know
that the 'xchg()' already did that.
Oh well. I guess even without the -1/+1 games by the OSQ code, we
would still end up with a "movl" just to do that upper bits clearing
that the compiler doesn't know is unnecessary.
I don't think we have any reasonable way to tell the compiler that the
register output of our xchg() inline asm has the upper 32 bits clear.
Linus
^ permalink raw reply [flat|nested] 30+ messages in thread* RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-30 20:59 ` Linus Torvalds
@ 2023-12-31 11:56 ` David Laight
0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2023-12-31 11:56 UTC (permalink / raw)
To: 'Linus Torvalds'
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
longman@redhat.com, mingo@redhat.com, will@kernel.org,
boqun.feng@gmail.com, xinhui.pan@linux.vnet.ibm.com,
virtualization@lists.linux-foundation.org, Zeng Heng
From: Linus Torvalds
> Sent: 30 December 2023 20:59
>
> On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > UNTESTED patch to just do the "this_cpu_write()" parts attached.
> > Again, note how we do end up doing that this_cpu_ptr conversion later
> > anyway, but at least it's off the critical path.
>
> Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
> code, it really is still better than caching a value in memory.
>
> At least the memory location that 'this_cpu_ptr()' accesses is
> slightly more likely to be hot (and is right next to the cpu number,
> iirc).
I was only going to access the 'self' field in code that required
the 'node' cache line be present.
>
> That said, I think we should fix this_cpu_ptr() to not ever generate
> that disgusting cltq just because the cpu pointer has the wrong
> signedness. I don't quite know how to do it, but this:
>
> -#define per_cpu_offset(x) (__per_cpu_offset[x])
> +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])
>
> at least helps a *bit*. It gets rid of the cltq, at least, but if
> somebody actually passes in an 'unsigned long' cpuid, it would cause
> an unnecessary truncation.
Doing the conversion using arithmetic might help, so:
__per_cpu_offset[(x) + 0u]
> And gcc still generates
>
> subl $1, %eax #, cpu_nr
> addq __per_cpu_offset(,%rax,8), %rcx
>
> instead of just doing
>
> addq __per_cpu_offset-8(,%rax,8), %rcx
>
> because it still needs to clear the upper 32 bits and doesn't know
> that the 'xchg()' already did that.
Not only that, you need to do the 'subl' after converting to 64 bits.
Otherwise the wrong location is read were cpu_nr to be zero.
I've tried that - but it still failed.
> Oh well. I guess even without the -1/+1 games by the OSQ code, we
> would still end up with a "movl" just to do that upper bits clearing
> that the compiler doesn't know is unnecessary.
>
> I don't think we have any reasonable way to tell the compiler that the
> register output of our xchg() inline asm has the upper 32 bits clear.
It could be done for a 32bit unsigned xchg() - just make the return
type unsigned 64bit.
But that won't work for the signed exchange - and 'atomic_t' is signed.
OTOH I'd guess this code could use 'unsigned int' instead of atomic_t?
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
2023-12-30 20:41 ` Linus Torvalds
2023-12-30 20:59 ` Linus Torvalds
@ 2023-12-31 11:41 ` David Laight
1 sibling, 0 replies; 30+ messages in thread
From: David Laight @ 2023-12-31 11:41 UTC (permalink / raw)
To: 'Linus Torvalds'
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
longman@redhat.com, mingo@redhat.com, will@kernel.org,
boqun.feng@gmail.com, xinhui.pan@linux.vnet.ibm.com,
virtualization@lists.linux-foundation.org, Zeng Heng
From: Linus Torvalds
> Sent: 30 December 2023 20:41
>
> On Fri, 29 Dec 2023 at 12:57, David Laight <David.Laight@aculab.com> wrote:
> >
> > this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> > the latter can use an 'offset from register' (%gs for x86-84).
> >
> > Add a 'self' field to 'struct optimistic_spin_node' that can be
> > read with raw_cpu_read(), initialise on first call.
>
> No, this is horrible.
>
> The problem isn't the "this_cpu_ptr()", it's the rest of the code.
>
> > bool osq_lock(struct optimistic_spin_queue *lock)
> > {
> > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
>
> No. Both of these are crap.
>
> > struct optimistic_spin_node *prev, *next;
> > int old;
> >
> > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> > - node->cpu = encode_cpu(smp_processor_id());
> > + if (unlikely(!node)) {
> > + int cpu = encode_cpu(smp_processor_id());
> > + node = decode_cpu(cpu);
> > + node->self = node;
> > + node->cpu = cpu;
> > + }
>
> The proper fix here is to not do that silly
>
> node = this_cpu_ptr(&osq_node);
> ..
> node->next = NULL;
>
> dance at all, but to simply do
>
> this_cpu_write(osq_node.next, NULL);
>
> in the first place. That makes the whole thing just a single store off
> the segment descriptor.
Is the equivalent true (ie offset from fixed register) for all SMP archs?
Or do some have to do something rather more complicated?
> Yes, you'll eventually end up doing that
>
> node = this_cpu_ptr(&osq_node);
For some reason I've had CONFIG_DEBUG_PREEMPT set. I don't remember
setting it, and can't imagine why I might have.
Best guess is it was inherited from the ubuntu .config I started with.
In any case it changes smp_processor_id() into a real function
in order to check that preemption is disabled.
I'd guess something like BUG_ON(!raw_cpu_read(preempt_disable_count))
would be faster and more obvious!
> thing because it then wants to use that raw pointer to do
>
> WRITE_ONCE(prev->next, node);
>
> but that's a separate issue and still does not make it worth it to
> create a pointless self-pointer.
I could claim that loading it is one instruction shorter and that
if the cache line containing 'node' is needed and 'pcpu_hot'
is (unexpectedly) not cached it saves a cache line load.
But I probably won't!
>
> Btw, if you *really* want to solve that separate issue, then make the
> optimistic_spin_node struct not contain the pointers at all, but the
> CPU numbers, and then turn those numbers into the pointers the exact
> same way it does for the "lock->tail" thing, ie doing that whole
>
> prev = decode_cpu(old);
>
> dance. That *may* then result in avoiding turning them into pointers
> at all in some cases.
Don't think so.
Pretty much all the uses need to dereference the next/prev pointers.
> Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
> be -1 instead, and add something like
>
> #define IS_OSQ_UNLOCKED(x) ((int)(x)<0)
I did think about that (but not for these patches).
But it is a lot more dangerous because it absolutely requires
the structure be correctly initialised, not just be all zero.
That might show up some very strange bugs.
> and that would then avoid the +1 / -1 games in encoding/decoding the
> CPU numbers. It causes silly code generated like this:
>
> subl $1, %eax #, cpu_nr
> ...
> cltq
> addq __per_cpu_offset(,%rax,8), %rcx
>
> which seems honestly stupid. The cltq is there for sign-extension,
> which is because all these things are "int", and the "subl" will
> zero-extend to 64-bit, not sign-extend.
Changing all the variables to 'unsigned int' will remove the sign
extension and, after the decrement, gcc will know the high bits
are zero so shouldn't need to zero extend.
> At that point, I think gcc might be able to just generate
>
> addq __per_cpu_offset-8(,%rax,8), %rcx
That would need the decrement to be 64bit.
A quick test failed to make that work.
Probably (as you mentioned in the next email) because gcc
doesn't know that the high bits from atomic_xchg() are zero.
> but honestly, I think it would be nicer to just have decode_cpu() do
>
> unsigned int cpu_nr = encoded_cpu_val;
>
> return per_cpu_ptr(&osq_node, cpu_nr);
>
> and not have the -1/+1 at all.
Unless you can persuade gcc that the high bits from atomic_xchg()
are zero that will require a zero extend.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
[parent not found: <ZZB/jIvKgKQ2sV7M@gmail.com>]
* RE: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.
[not found] ` <ZZB/jIvKgKQ2sV7M@gmail.com>
@ 2023-12-30 22:47 ` David Laight
0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2023-12-30 22:47 UTC (permalink / raw)
To: 'Ingo Molnar'
Cc: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'longman@redhat.com',
'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
From: Ingo Molnar
> Sent: 30 December 2023 20:38
>
> * David Laight <David.Laight@ACULAB.COM> wrote:
>
> > bool osq_lock(struct optimistic_spin_queue *lock)
> > {
> > - struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > + struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);
> > struct optimistic_spin_node *prev, *next;
> > int old;
> >
> > - if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> > - node->cpu = encode_cpu(smp_processor_id());
> > + if (unlikely(!node)) {
> > + int cpu = encode_cpu(smp_processor_id());
> > + node = decode_cpu(cpu);
> > + node->self = node;
> > + node->cpu = cpu;
>
> This whole initialization sequence is suboptimal and needs to be
> cleaned up first: the node->cpu field is constant once initialized, so
> it should be initialized from appropriate init methods, not runtime in
> osq_lock(), right?
I thought that as well, but there would need to be a list of 'init'
functions for the per-cpu data. I didn't spot one.
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.
2023-12-29 20:51 [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code David Laight
` (3 preceding siblings ...)
2023-12-29 20:57 ` [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses David Laight
@ 2023-12-29 20:58 ` David Laight
2023-12-30 3:13 ` Waiman Long
2023-12-29 22:11 ` [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path David Laight
2023-12-30 19:40 ` [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code Linus Torvalds
6 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2023-12-29 20:58 UTC (permalink / raw)
To: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'longman@redhat.com'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
cpu is no longer running.
Although patched out for bare-metal the code still needs the cpu number.
Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
when osq_unlock() is waking up the next cpu.
Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
This is simpler than checking for 'node->prev' changing and caching
'prev->cpu'.
Signed-off-by: David Laight <david.laight@aculab.com>
---
kernel/locking/osq_lock.c | 14 ++++++--------
1 file changed, 6 insertions(+), 8 deletions(-)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index b60b0add0161..89be63627434 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -14,8 +14,9 @@
struct optimistic_spin_node {
struct optimistic_spin_node *self, *next, *prev;
- int locked; /* 1 if lock acquired */
- int cpu; /* encoded CPU # + 1 value */
+ int locked; /* 1 if lock acquired */
+ int cpu; /* encoded CPU # + 1 value */
+ int prev_cpu; /* actual CPU # for vpcu_is_preempted() */
};
static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
@@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
}
-static inline int node_cpu(struct optimistic_spin_node *node)
-{
- return node->cpu - 1;
-}
-
static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
{
int cpu_nr = encoded_cpu_val - 1;
@@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
if (old == OSQ_UNLOCKED_VAL)
return true;
+ node->prev_cpu = old - 1;
prev = decode_cpu(old);
node->prev = prev;
node->locked = 0;
@@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* polling, be careful.
*/
if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
- vcpu_is_preempted(node_cpu(node->prev))))
+ vcpu_is_preempted(node->prev_cpu)))
return true;
/* unqueue */
@@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* it will wait in Step-A.
*/
+ WRITE_ONCE(next->prev_cpu, prev->cpu - 1);
WRITE_ONCE(next->prev, prev);
WRITE_ONCE(prev->next, next);
--
2.17.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.
2023-12-29 20:58 ` [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check David Laight
@ 2023-12-30 3:13 ` Waiman Long
2023-12-30 15:57 ` Waiman Long
0 siblings, 1 reply; 30+ messages in thread
From: Waiman Long @ 2023-12-30 3:13 UTC (permalink / raw)
To: David Laight, 'linux-kernel@vger.kernel.org',
'peterz@infradead.org'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
On 12/29/23 15:58, David Laight wrote:
> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
> cpu is no longer running.
> Although patched out for bare-metal the code still needs the cpu number.
> Reading this from 'prev->cpu' is a pretty much guaranteed have a cache miss
> when osq_unlock() is waking up the next cpu.
>
> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
>
> This is simpler than checking for 'node->prev' changing and caching
> 'prev->cpu'.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
> kernel/locking/osq_lock.c | 14 ++++++--------
> 1 file changed, 6 insertions(+), 8 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index b60b0add0161..89be63627434 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -14,8 +14,9 @@
>
> struct optimistic_spin_node {
> struct optimistic_spin_node *self, *next, *prev;
> - int locked; /* 1 if lock acquired */
> - int cpu; /* encoded CPU # + 1 value */
> + int locked; /* 1 if lock acquired */
> + int cpu; /* encoded CPU # + 1 value */
> + int prev_cpu; /* actual CPU # for vpcu_is_preempted() */
> };
>
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node, osq_node);
> @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
> return cpu_nr + 1;
> }
>
> -static inline int node_cpu(struct optimistic_spin_node *node)
> -{
> - return node->cpu - 1;
> -}
> -
> static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
> {
> int cpu_nr = encoded_cpu_val - 1;
> @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> if (old == OSQ_UNLOCKED_VAL)
> return true;
>
> + node->prev_cpu = old - 1;
> prev = decode_cpu(old);
> node->prev = prev;
> node->locked = 0;
> @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * polling, be careful.
> */
> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> - vcpu_is_preempted(node_cpu(node->prev))))
> + vcpu_is_preempted(node->prev_cpu)))
> return true;
>
> /* unqueue */
> @@ -205,6 +202,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * it will wait in Step-A.
> */
>
> + WRITE_ONCE(next->prev_cpu, prev->cpu - 1);
> WRITE_ONCE(next->prev, prev);
> WRITE_ONCE(prev->next, next);
Reviewed-by: Waiman Long <longman@redhat.com>
>
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.
2023-12-30 3:13 ` Waiman Long
@ 2023-12-30 15:57 ` Waiman Long
2023-12-30 22:37 ` David Laight
0 siblings, 1 reply; 30+ messages in thread
From: Waiman Long @ 2023-12-30 15:57 UTC (permalink / raw)
To: David Laight, 'linux-kernel@vger.kernel.org',
'peterz@infradead.org'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
On 12/29/23 22:13, Waiman Long wrote:
>
> On 12/29/23 15:58, David Laight wrote:
>> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
>> cpu is no longer running.
>> Although patched out for bare-metal the code still needs the cpu number.
>> Reading this from 'prev->cpu' is a pretty much guaranteed have a
>> cache miss
>> when osq_unlock() is waking up the next cpu.
>>
>> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
>> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
>>
>> This is simpler than checking for 'node->prev' changing and caching
>> 'prev->cpu'.
>>
>> Signed-off-by: David Laight <david.laight@aculab.com>
>> ---
>> kernel/locking/osq_lock.c | 14 ++++++--------
>> 1 file changed, 6 insertions(+), 8 deletions(-)
>>
>> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
>> index b60b0add0161..89be63627434 100644
>> --- a/kernel/locking/osq_lock.c
>> +++ b/kernel/locking/osq_lock.c
>> @@ -14,8 +14,9 @@
>> struct optimistic_spin_node {
>> struct optimistic_spin_node *self, *next, *prev;
>> - int locked; /* 1 if lock acquired */
>> - int cpu; /* encoded CPU # + 1 value */
>> + int locked; /* 1 if lock acquired */
>> + int cpu; /* encoded CPU # + 1 value */
>> + int prev_cpu; /* actual CPU # for vpcu_is_preempted() */
>> };
>> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
>> osq_node);
>> @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
>> return cpu_nr + 1;
>> }
>> -static inline int node_cpu(struct optimistic_spin_node *node)
>> -{
>> - return node->cpu - 1;
>> -}
>> -
>> static inline struct optimistic_spin_node *decode_cpu(int
>> encoded_cpu_val)
>> {
>> int cpu_nr = encoded_cpu_val - 1;
>> @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>> if (old == OSQ_UNLOCKED_VAL)
>> return true;
>> + node->prev_cpu = old - 1;
>> prev = decode_cpu(old);
>> node->prev = prev;
>> node->locked = 0;
>> @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
>> * polling, be careful.
>> */
>> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
>> - vcpu_is_preempted(node_cpu(node->prev))))
>> + vcpu_is_preempted(node->prev_cpu)))
On second thought, I believe it is more correct to use READ_ONCE() to
access "node->prev_cpu" as this field is subjected to change by a
WRITE_ONCE().
Cheers,
Longman
^ permalink raw reply [flat|nested] 30+ messages in thread* RE: [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check.
2023-12-30 15:57 ` Waiman Long
@ 2023-12-30 22:37 ` David Laight
0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2023-12-30 22:37 UTC (permalink / raw)
To: 'Waiman Long', 'linux-kernel@vger.kernel.org',
'peterz@infradead.org'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
From: Waiman Long
> Sent: 30 December 2023 15:57
>
> On 12/29/23 22:13, Waiman Long wrote:
> >
> > On 12/29/23 15:58, David Laight wrote:
> >> The vcpu_is_preempted() test stops osq_lock() spinning if a virtual
> >> cpu is no longer running.
> >> Although patched out for bare-metal the code still needs the cpu number.
> >> Reading this from 'prev->cpu' is a pretty much guaranteed have a
> >> cache miss
> >> when osq_unlock() is waking up the next cpu.
> >>
> >> Instead save 'prev->cpu' in 'node->prev_cpu' and use that value instead.
> >> Update in the osq_lock() 'unqueue' path when 'node->prev' is changed.
> >>
> >> This is simpler than checking for 'node->prev' changing and caching
> >> 'prev->cpu'.
> >>
> >> Signed-off-by: David Laight <david.laight@aculab.com>
> >> ---
> >> kernel/locking/osq_lock.c | 14 ++++++--------
> >> 1 file changed, 6 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> >> index b60b0add0161..89be63627434 100644
> >> --- a/kernel/locking/osq_lock.c
> >> +++ b/kernel/locking/osq_lock.c
> >> @@ -14,8 +14,9 @@
> >> struct optimistic_spin_node {
> >> struct optimistic_spin_node *self, *next, *prev;
> >> - int locked; /* 1 if lock acquired */
> >> - int cpu; /* encoded CPU # + 1 value */
> >> + int locked; /* 1 if lock acquired */
> >> + int cpu; /* encoded CPU # + 1 value */
> >> + int prev_cpu; /* actual CPU # for vpcu_is_preempted() */
> >> };
> >> static DEFINE_PER_CPU_SHARED_ALIGNED(struct optimistic_spin_node,
> >> osq_node);
> >> @@ -29,11 +30,6 @@ static inline int encode_cpu(int cpu_nr)
> >> return cpu_nr + 1;
> >> }
> >> -static inline int node_cpu(struct optimistic_spin_node *node)
> >> -{
> >> - return node->cpu - 1;
> >> -}
> >> -
> >> static inline struct optimistic_spin_node *decode_cpu(int
> >> encoded_cpu_val)
> >> {
> >> int cpu_nr = encoded_cpu_val - 1;
> >> @@ -114,6 +110,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >> if (old == OSQ_UNLOCKED_VAL)
> >> return true;
> >> + node->prev_cpu = old - 1;
> >> prev = decode_cpu(old);
> >> node->prev = prev;
> >> node->locked = 0;
> >> @@ -148,7 +145,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> >> * polling, be careful.
> >> */
> >> if (smp_cond_load_relaxed(&node->locked, VAL || need_resched() ||
> >> - vcpu_is_preempted(node_cpu(node->prev))))
> >> + vcpu_is_preempted(node->prev_cpu)))
>
> On second thought, I believe it is more correct to use READ_ONCE() to
> access "node->prev_cpu" as this field is subjected to change by a
> WRITE_ONCE().
It can be done...
Aren't pretty much all the 'node' members accessed like that?
There are a sprinkling of READ_ONCE() and WRITE_ONCE() but they
are not always used.
Maybe the structure member(s) should just be marked 'volatile' :-)
That should have exactly the same effect as the volatile cast
inside READ/WRITE_ONCE().
(I know there is a document about not using volatile...)
I've not actually checked whether the two existing WRITE_ONCE()
in 'Step C' need to be ordered - and whether that is guaranteed
by the code, especially on out good old friend 'Alpha' (is that
horrid cache system still supported?).
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
2023-12-29 20:51 [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code David Laight
` (4 preceding siblings ...)
2023-12-29 20:58 ` [PATCH next 5/5] locking/osq_lock: Optimise vcpu_is_preempted() check David Laight
@ 2023-12-29 22:11 ` David Laight
2023-12-30 3:20 ` Waiman Long
2023-12-30 19:40 ` [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code Linus Torvalds
6 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2023-12-29 22:11 UTC (permalink / raw)
To: 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'longman@redhat.com'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
osq_lock() starts by setting node->next to NULL and node->locked to 0.
Careful analysis shows that node->next is always NULL on entry.
node->locked is set non-zero by another cpu to force a wakeup.
This can only happen after the 'prev->next = node' assignment,
so locked can be set to zero just before that (along with the assignment
to node->prev).
Only initialise node->cpu once, after that use its value instead
of smp_processor_id() - which is probably a real function call.
Should reduce cache-line bouncing a little.
Signed-off-by: David Laight <david.laight@aculab.com>
---
Re-send without the 'RE:' on the subject line.
kernel/locking/osq_lock.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index d414eef4bec6..55f5db896c02 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
struct optimistic_spin_node *prev)
{
struct optimistic_spin_node *next = NULL;
- int curr = encode_cpu(smp_processor_id());
+ int curr = node->cpu;
int old;
/*
@@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
struct optimistic_spin_node *prev, *next;
- int curr = encode_cpu(smp_processor_id());
int old;
- node->locked = 0;
- node->next = NULL;
- node->cpu = curr;
+ if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
+ node->cpu = encode_cpu(smp_processor_id());
/*
* We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
* the node fields we just initialised) semantics when updating
* the lock tail.
*/
- old = atomic_xchg(&lock->tail, curr);
+ old = atomic_xchg(&lock->tail, node->cpu);
if (old == OSQ_UNLOCKED_VAL)
return true;
prev = decode_cpu(old);
node->prev = prev;
+ node->locked = 0;
/*
* osq_lock() unqueue
@@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
void osq_unlock(struct optimistic_spin_queue *lock)
{
struct optimistic_spin_node *node, *next;
- int curr = encode_cpu(smp_processor_id());
+ int curr = raw_cpu_read(osq_node.cpu);
/*
* Fast path for the uncontended case.
--
2.17.1
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply related [flat|nested] 30+ messages in thread* Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
2023-12-29 22:11 ` [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path David Laight
@ 2023-12-30 3:20 ` Waiman Long
2023-12-30 15:49 ` David Laight
0 siblings, 1 reply; 30+ messages in thread
From: Waiman Long @ 2023-12-30 3:20 UTC (permalink / raw)
To: David Laight, 'linux-kernel@vger.kernel.org',
'peterz@infradead.org'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
On 12/29/23 17:11, David Laight wrote:
> osq_lock() starts by setting node->next to NULL and node->locked to 0.
> Careful analysis shows that node->next is always NULL on entry.
>
> node->locked is set non-zero by another cpu to force a wakeup.
> This can only happen after the 'prev->next = node' assignment,
> so locked can be set to zero just before that (along with the assignment
> to node->prev).
>
> Only initialise node->cpu once, after that use its value instead
> of smp_processor_id() - which is probably a real function call.
>
> Should reduce cache-line bouncing a little.
>
> Signed-off-by: David Laight <david.laight@aculab.com>
> ---
>
> Re-send without the 'RE:' on the subject line.
> kernel/locking/osq_lock.c | 13 ++++++-------
> 1 file changed, 6 insertions(+), 7 deletions(-)
>
> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d414eef4bec6..55f5db896c02 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> struct optimistic_spin_node *prev)
> {
> struct optimistic_spin_node *next = NULL;
> - int curr = encode_cpu(smp_processor_id());
> + int curr = node->cpu;
> int old;
>
> /*
> @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> {
> struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> struct optimistic_spin_node *prev, *next;
> - int curr = encode_cpu(smp_processor_id());
> int old;
>
> - node->locked = 0;
> - node->next = NULL;
I have some concern about not clearing node->next at the beginning. I
know that next is supposed to be cleared at the end. I do have some
worry that there may exist a race condition that leaves next not cleared
before it is used again. So you either have to prove that such a race
does not exist, or initializing it to NULL at the beginning like it is
today.
Cheers,
Longman
> - node->cpu = curr;
> + if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> + node->cpu = encode_cpu(smp_processor_id());
>
> /*
> * We need both ACQUIRE (pairs with corresponding RELEASE in
> @@ -111,12 +109,13 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> * the node fields we just initialised) semantics when updating
> * the lock tail.
> */
> - old = atomic_xchg(&lock->tail, curr);
> + old = atomic_xchg(&lock->tail, node->cpu);
> if (old == OSQ_UNLOCKED_VAL)
> return true;
>
> prev = decode_cpu(old);
> node->prev = prev;
> + node->locked = 0;
>
> /*
> * osq_lock() unqueue
> @@ -214,7 +213,7 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> void osq_unlock(struct optimistic_spin_queue *lock)
> {
> struct optimistic_spin_node *node, *next;
> - int curr = encode_cpu(smp_processor_id());
> + int curr = raw_cpu_read(osq_node.cpu);
>
> /*
> * Fast path for the uncontended case.
^ permalink raw reply [flat|nested] 30+ messages in thread* RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
2023-12-30 3:20 ` Waiman Long
@ 2023-12-30 15:49 ` David Laight
2024-01-02 18:53 ` Boqun Feng
0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2023-12-30 15:49 UTC (permalink / raw)
To: 'Waiman Long', 'linux-kernel@vger.kernel.org',
'peterz@infradead.org'
Cc: 'mingo@redhat.com', 'will@kernel.org',
'boqun.feng@gmail.com', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
From: Waiman Long
> Sent: 30 December 2023 03:20
>
> On 12/29/23 17:11, David Laight wrote:
> > osq_lock() starts by setting node->next to NULL and node->locked to 0.
> > Careful analysis shows that node->next is always NULL on entry.
> >
> > node->locked is set non-zero by another cpu to force a wakeup.
> > This can only happen after the 'prev->next = node' assignment,
> > so locked can be set to zero just before that (along with the assignment
> > to node->prev).
> >
> > Only initialise node->cpu once, after that use its value instead
> > of smp_processor_id() - which is probably a real function call.
> >
> > Should reduce cache-line bouncing a little.
> >
> > Signed-off-by: David Laight <david.laight@aculab.com>
> > ---
> >
> > Re-send without the 'RE:' on the subject line.
> > kernel/locking/osq_lock.c | 13 ++++++-------
> > 1 file changed, 6 insertions(+), 7 deletions(-)
> >
> > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> > index d414eef4bec6..55f5db896c02 100644
> > --- a/kernel/locking/osq_lock.c
> > +++ b/kernel/locking/osq_lock.c
> > @@ -51,7 +51,7 @@ osq_wait_next(struct optimistic_spin_queue *lock,
> > struct optimistic_spin_node *prev)
> > {
> > struct optimistic_spin_node *next = NULL;
> > - int curr = encode_cpu(smp_processor_id());
> > + int curr = node->cpu;
> > int old;
> >
> > /*
> > @@ -98,12 +98,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
> > {
> > struct optimistic_spin_node *node = this_cpu_ptr(&osq_node);
> > struct optimistic_spin_node *prev, *next;
> > - int curr = encode_cpu(smp_processor_id());
> > int old;
> >
> > - node->locked = 0;
> > - node->next = NULL;
>
> I have some concern about not clearing node->next at the beginning. I
> know that next is supposed to be cleared at the end. I do have some
> worry that there may exist a race condition that leaves next not cleared
> before it is used again. So you either have to prove that such a race
> does not exist, or initializing it to NULL at the beginning like it is
> today.
I've stared at the code some more :-)
There are two places where node->next is written non-NULL, both in osq_lock().
The first is at the top of the slow-path where prev->next = node
(this should be overwriting the NULL set (or not) on entry).
The second is at the bottom of osq_lock() prev->next = next (Step C)
where the NULL written in Step A is written with the correct 'next' node.
After either of those the 'node' cpu must later either take the unqueue
exit from osq_lock() or call osq_unlock().
Both require it wait for node->next be non-NULL and NULL it.
If code calls osq_lock() twice all bets are off!
Even if (somehow) node->next was non-NULL on entry it will be set by an
osq_lock() call from another cpu.
The only problem would be if osq_unlock() were called before the queueing
cpu set prev->next = node.
That in itself is unlikely - but would happen if node->next were always set.
I don't completely understand the 'acquire'/'release' semantics (they didn't
exist when I started doing SMP kernel code in the late 1980s).
But it looks odd that osq_unlock()'s fast path uses _release but the very
similar code in osq_wait_next() uses _acquire.
Indeed, apart from some (assumed) optimisations, I think osq_unlock()
could just be:
next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
if (next)
next->locked = 1;
I don't think the order of the tests for lock->tail and node->next
matter in osq_wait_next().
If they were swapped the 'Second most likely case' code from osq_unlock()
could be removed.
(The 'uncontended case' doesn't need to load the address of 'node'.)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread* Re: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
2023-12-30 15:49 ` David Laight
@ 2024-01-02 18:53 ` Boqun Feng
2024-01-02 23:32 ` David Laight
0 siblings, 1 reply; 30+ messages in thread
From: Boqun Feng @ 2024-01-02 18:53 UTC (permalink / raw)
To: David Laight
Cc: 'Waiman Long', 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'mingo@redhat.com',
'will@kernel.org', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote:
[...]
> I don't completely understand the 'acquire'/'release' semantics (they didn't
> exist when I started doing SMP kernel code in the late 1980s).
> But it looks odd that osq_unlock()'s fast path uses _release but the very
> similar code in osq_wait_next() uses _acquire.
>
The _release in osq_unlock() is needed since unlocks are needed to be
RELEASE so that lock+unlock can be a critical section (i.e. no memory
accesses can escape). When osq_wait_next() is used in non unlock cases,
the RELEASE is not required. As for the case where osq_wait_next() is
used in osq_unlock(), there is a xchg() preceding it, which provides a
full barrier, so things are fine.
/me wonders whether we can relax the _acquire in osq_wait_next() into
a _relaxed.
> Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> could just be:
> next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
> if (next)
> next->locked = 1;
>
If so we need to provide some sort of RELEASE semantics for the
osq_unlock() in all the cases.
Regards,
Boqun
> I don't think the order of the tests for lock->tail and node->next
> matter in osq_wait_next().
> If they were swapped the 'Second most likely case' code from osq_unlock()
> could be removed.
> (The 'uncontended case' doesn't need to load the address of 'node'.)
>
> David
>
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* RE: [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path.
2024-01-02 18:53 ` Boqun Feng
@ 2024-01-02 23:32 ` David Laight
0 siblings, 0 replies; 30+ messages in thread
From: David Laight @ 2024-01-02 23:32 UTC (permalink / raw)
To: 'Boqun Feng'
Cc: 'Waiman Long', 'linux-kernel@vger.kernel.org',
'peterz@infradead.org', 'mingo@redhat.com',
'will@kernel.org', 'Linus Torvalds',
'xinhui.pan@linux.vnet.ibm.com',
'virtualization@lists.linux-foundation.org',
'Zeng Heng'
From: Boqun Feng
> Sent: 02 January 2024 18:54
>
> On Sat, Dec 30, 2023 at 03:49:52PM +0000, David Laight wrote:
> [...]
> > But it looks odd that osq_unlock()'s fast path uses _release but the very
> > similar code in osq_wait_next() uses _acquire.
> >
>
> The _release in osq_unlock() is needed since unlocks are needed to be
> RELEASE so that lock+unlock can be a critical section (i.e. no memory
> accesses can escape). When osq_wait_next() is used in non unlock cases,
> the RELEASE is not required. As for the case where osq_wait_next() is
> used in osq_unlock(), there is a xchg() preceding it, which provides a
> full barrier, so things are fine.
I know there have been issues with ACQUIRE/RELEASE/FULL xchg in this code,
but are FULL xchg always needed on node->next?
> /me wonders whether we can relax the _acquire in osq_wait_next() into
> a _relaxed.
I wouldn't have worried about relaxed v release.
> > Indeed, apart from some (assumed) optimisations, I think osq_unlock()
> > could just be:
> > next = osq_wait_next(lock, this_cpu_ptr(&osq_node), 0);
> > if (next)
> > next->locked = 1;
> >
>
> If so we need to provide some sort of RELEASE semantics for the
> osq_unlock() in all the cases.
I wonder how often the unqueue code happens, and especially for
the last cpu in the list?
I'd only expect need_resched() to return true after spinning for
a while - in which case perhaps it is more likely that there are
a lot of cpu in the queue and the cpu being removed won't be last.
So osq_wait_next() exits on xchg(&node->next, NULL) != NULL
which is full barrier.
On a slightly different note I've also wondered if 'osq_node'
actually needs to be cache line aligned?
You definitely don't want it spanning 2 cache line, but I'd
expect that per-cpu data is mostly accessed by its own cpu?
So you really aren't going to get false sharing with some
other per-cpu data since the cpu is busy in this code.
So __aligned(16) would do?
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code
2023-12-29 20:51 [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code David Laight
` (5 preceding siblings ...)
2023-12-29 22:11 ` [PATCH next 2/5] locking/osq_lock: Avoid dirtying the local cpu's 'node' in the osq_lock() fast path David Laight
@ 2023-12-30 19:40 ` Linus Torvalds
2023-12-30 22:39 ` David Laight
6 siblings, 1 reply; 30+ messages in thread
From: Linus Torvalds @ 2023-12-30 19:40 UTC (permalink / raw)
To: David Laight
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
longman@redhat.com, mingo@redhat.com, will@kernel.org,
boqun.feng@gmail.com, xinhui.pan@linux.vnet.ibm.com,
virtualization@lists.linux-foundation.org, Zeng Heng
On Fri, 29 Dec 2023 at 12:52, David Laight <David.Laight@aculab.com> wrote:
>
> David Laight (5):
> Move the definition of optimistic_spin_node into osf_lock.c
> Clarify osq_wait_next()
I took these two as preparatory independent patches, with that
osq_wait_next() clarification split into two.
I also did the renaming that Waiman asked for.
Linus
^ permalink raw reply [flat|nested] 30+ messages in thread* RE: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code
2023-12-30 19:40 ` [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code Linus Torvalds
@ 2023-12-30 22:39 ` David Laight
2023-12-31 2:14 ` Waiman Long
0 siblings, 1 reply; 30+ messages in thread
From: David Laight @ 2023-12-30 22:39 UTC (permalink / raw)
To: 'Linus Torvalds'
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
longman@redhat.com, mingo@redhat.com, will@kernel.org,
boqun.feng@gmail.com, xinhui.pan@linux.vnet.ibm.com,
virtualization@lists.linux-foundation.org, Zeng Heng
From: Linus Torvalds
> Sent: 30 December 2023 19:41
>
> On Fri, 29 Dec 2023 at 12:52, David Laight <David.Laight@aculab.com> wrote:
> >
> > David Laight (5):
> > Move the definition of optimistic_spin_node into osf_lock.c
> > Clarify osq_wait_next()
>
> I took these two as preparatory independent patches, with that
> osq_wait_next() clarification split into two.
>
> I also did the renaming that Waiman asked for.
Ok, I'll try to remove them from any respin.
(Or at least put them first!)
David
-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code
2023-12-30 22:39 ` David Laight
@ 2023-12-31 2:14 ` Waiman Long
0 siblings, 0 replies; 30+ messages in thread
From: Waiman Long @ 2023-12-31 2:14 UTC (permalink / raw)
To: David Laight, 'Linus Torvalds'
Cc: linux-kernel@vger.kernel.org, peterz@infradead.org,
mingo@redhat.com, will@kernel.org, boqun.feng@gmail.com,
xinhui.pan@linux.vnet.ibm.com,
virtualization@lists.linux-foundation.org, Zeng Heng
On 12/30/23 17:39, David Laight wrote:
> From: Linus Torvalds
>> Sent: 30 December 2023 19:41
>>
>> On Fri, 29 Dec 2023 at 12:52, David Laight <David.Laight@aculab.com> wrote:
>>> David Laight (5):
>>> Move the definition of optimistic_spin_node into osf_lock.c
>>> Clarify osq_wait_next()
>> I took these two as preparatory independent patches, with that
>> osq_wait_next() clarification split into two.
>>
>> I also did the renaming that Waiman asked for.
> Ok, I'll try to remove them from any respin.
> (Or at least put them first!)
The first 2 patches have been included in Linus' master branch. So you
should just remove them from a respin.
Cheers,
Longman
^ permalink raw reply [flat|nested] 30+ messages in thread