[PATCH 00/13] locking/qspinlock: simplify code generation

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/13] locking/qspinlock: simplify code generation
@ 2022-07-04 14:38 Nicholas Piggin
  2022-07-04 14:38 ` [PATCH 01/13] locking/qspinlock: remove pv_node abstraction Nicholas Piggin
                   ` (13 more replies)
  0 siblings, 14 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

Hi,

Been recently looking a bit closer at queued spinlock code, and
found it's a little tricky to follow especially the pv generation.
This series tries to improve the situation. It's not well tested
outside powerpc, but it's really the x86 pv code that is the
other major complexity that should need some review and testing.
Opinions?

Thanks,
Nick

Nicholas Piggin (13):
  locking/qspinlock: remove pv_node abstraction
  locking/qspinlock: inline mcs_spinlock functions into qspinlock
  locking/qspinlock: split common mcs queueing code into its own
    function
  locking/qspinlock: move pv lock word helpers into qspinlock.c
  locking/qspinlock: be less clever with the preprocessor
  locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c
  locking/qspinlock: remove arch qspinlock_paravirt.h includes
  locking/qspinlock: stop renaming queued_spin_lock_slowpath to
    native_queued_spin_lock_slowpath
  locking/qspinlock: rename __pv_init_lock_hash to pv_spinlocks_init
  locking/qspinlock: paravirt use simple trylock in case idx overflows
  locking/qspinlock: Use queued_spin_trylock in
    pv_hybrid_queued_unfair_trylock
  locking/qspinlock: separate pv_wait_node from the non-paravirt path
  locking/qspinlock: simplify pv_wait_head_or_lock calling scheme

 arch/alpha/include/asm/Kbuild                 |   1 -
 arch/arc/include/asm/Kbuild                   |   1 -
 arch/arm/include/asm/mcs_spinlock.h           |  24 -
 arch/arm64/include/asm/Kbuild                 |   1 -
 arch/hexagon/include/asm/Kbuild               |   1 -
 arch/ia64/include/asm/Kbuild                  |   1 -
 arch/m68k/include/asm/Kbuild                  |   1 -
 arch/microblaze/include/asm/Kbuild            |   1 -
 arch/mips/include/asm/Kbuild                  |   1 -
 arch/nios2/include/asm/Kbuild                 |   1 -
 arch/parisc/include/asm/Kbuild                |   1 -
 arch/powerpc/include/asm/Kbuild               |   1 -
 arch/powerpc/include/asm/qspinlock.h          |  45 +-
 arch/powerpc/include/asm/qspinlock_paravirt.h |   7 -
 arch/powerpc/include/asm/spinlock.h           |   2 +-
 arch/s390/include/asm/Kbuild                  |   1 -
 arch/sh/include/asm/Kbuild                    |   1 -
 arch/sparc/include/asm/Kbuild                 |   1 -
 arch/um/include/asm/Kbuild                    |   1 -
 arch/x86/hyperv/hv_spinlock.c                 |   2 +-
 arch/x86/include/asm/Kbuild                   |   1 -
 arch/x86/include/asm/qspinlock.h              |  19 +-
 arch/x86/include/asm/qspinlock_paravirt.h     |  72 --
 arch/x86/kernel/kvm.c                         |   2 +-
 arch/x86/kernel/paravirt-spinlocks.c          |  71 ++
 arch/x86/kernel/paravirt.c                    |   2 +-
 arch/x86/xen/spinlock.c                       |   2 +-
 arch/xtensa/include/asm/Kbuild                |   1 -
 include/asm-generic/mcs_spinlock.h            |  13 -
 include/asm-generic/qspinlock.h               |   6 +
 kernel/locking/mcs_spinlock.h                 | 121 ---
 kernel/locking/qspinlock.c                    | 834 ++++++++++++++----
 kernel/locking/qspinlock_paravirt.h           | 562 ------------
 33 files changed, 764 insertions(+), 1037 deletions(-)
 delete mode 100644 arch/arm/include/asm/mcs_spinlock.h
 delete mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
 delete mode 100644 arch/x86/include/asm/qspinlock_paravirt.h
 delete mode 100644 include/asm-generic/mcs_spinlock.h
 delete mode 100644 kernel/locking/mcs_spinlock.h
 delete mode 100644 kernel/locking/qspinlock_paravirt.h

-- 
2.35.1


^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 01/13] locking/qspinlock: remove pv_node abstraction
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-06 23:23   ` Boqun Feng
  2022-07-04 14:38 ` [PATCH 02/13] locking/qspinlock: inline mcs_spinlock functions into qspinlock Nicholas Piggin
                   ` (12 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

There isn't much point trying to separate struct qnode from struct pv_node
when struct qnode has to know about pv_node anyway.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c          |  3 ++-
 kernel/locking/qspinlock_paravirt.h | 34 ++++++++++++-----------------
 2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 65a9a10caa6f..a0fc21d99199 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -82,7 +82,8 @@
 struct qnode {
 	struct mcs_spinlock mcs;
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-	long reserved[2];
+	int			cpu;
+	u8			state;
 #endif
 };
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index e84d21aa0722..b6a175155f36 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -47,12 +47,6 @@ enum vcpu_state {
 	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
 };
 
-struct pv_node {
-	struct mcs_spinlock	mcs;
-	int			cpu;
-	u8			state;
-};
-
 /*
  * Hybrid PV queued/unfair lock
  *
@@ -170,7 +164,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
  */
 struct pv_hash_entry {
 	struct qspinlock *lock;
-	struct pv_node   *node;
+	struct qnode   *node;
 };
 
 #define PV_HE_PER_LINE	(SMP_CACHE_BYTES / sizeof(struct pv_hash_entry))
@@ -209,7 +203,7 @@ void __init __pv_init_lock_hash(void)
 	     offset < (1 << pv_lock_hash_bits);						\
 	     offset++, he = &pv_lock_hash[(hash + offset) & ((1 << pv_lock_hash_bits) - 1)])
 
-static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
+static struct qspinlock **pv_hash(struct qspinlock *lock, struct qnode *node)
 {
 	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
 	struct pv_hash_entry *he;
@@ -236,11 +230,11 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
 	BUG();
 }
 
-static struct pv_node *pv_unhash(struct qspinlock *lock)
+static struct qnode *pv_unhash(struct qspinlock *lock)
 {
 	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
 	struct pv_hash_entry *he;
-	struct pv_node *node;
+	struct qnode *node;
 
 	for_each_hash_entry(he, offset, hash) {
 		if (READ_ONCE(he->lock) == lock) {
@@ -264,7 +258,7 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
  * in a running state.
  */
 static inline bool
-pv_wait_early(struct pv_node *prev, int loop)
+pv_wait_early(struct qnode *prev, int loop)
 {
 	if ((loop & PV_PREV_CHECK_MASK) != 0)
 		return false;
@@ -277,9 +271,9 @@ pv_wait_early(struct pv_node *prev, int loop)
  */
 static void pv_init_node(struct mcs_spinlock *node)
 {
-	struct pv_node *pn = (struct pv_node *)node;
+	struct qnode *pn = (struct qnode *)node;
 
-	BUILD_BUG_ON(sizeof(struct pv_node) > sizeof(struct qnode));
+	BUILD_BUG_ON(sizeof(struct qnode) > sizeof(struct qnode));
 
 	pn->cpu = smp_processor_id();
 	pn->state = vcpu_running;
@@ -292,8 +286,8 @@ static void pv_init_node(struct mcs_spinlock *node)
  */
 static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 {
-	struct pv_node *pn = (struct pv_node *)node;
-	struct pv_node *pp = (struct pv_node *)prev;
+	struct qnode *pn = (struct qnode *)node;
+	struct qnode *pp = (struct qnode *)prev;
 	int loop;
 	bool wait_early;
 
@@ -359,7 +353,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
  */
 static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 {
-	struct pv_node *pn = (struct pv_node *)node;
+	struct qnode *pn = (struct qnode *)node;
 
 	/*
 	 * If the vCPU is indeed halted, advance its state to match that of
@@ -402,7 +396,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 static u32
 pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 {
-	struct pv_node *pn = (struct pv_node *)node;
+	struct qnode *pn = (struct qnode *)node;
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
@@ -492,7 +486,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 __visible void
 __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 {
-	struct pv_node *node;
+	struct qnode *node;
 
 	if (unlikely(locked != _Q_SLOW_VAL)) {
 		WARN(!debug_locks_silent,
@@ -517,14 +511,14 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	node = pv_unhash(lock);
 
 	/*
-	 * Now that we have a reference to the (likely) blocked pv_node,
+	 * Now that we have a reference to the (likely) blocked qnode,
 	 * release the lock.
 	 */
 	smp_store_release(&lock->locked, 0);
 
 	/*
 	 * At this point the memory pointed at by lock can be freed/reused,
-	 * however we can still use the pv_node to kick the CPU.
+	 * however we can still use the qnode to kick the CPU.
 	 * The other vCPU may not really be halted, but kicking an active
 	 * vCPU is harmless other than the additional latency in completing
 	 * the unlock.
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 01/13] locking/qspinlock: remove pv_node abstraction
  2022-07-04 14:38 ` [PATCH 01/13] locking/qspinlock: remove pv_node abstraction Nicholas Piggin
@ 2022-07-06 23:23   ` Boqun Feng
  0 siblings, 0 replies; 38+ messages in thread
From: Boqun Feng @ 2022-07-06 23:23 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	linux-kernel

On Tue, Jul 05, 2022 at 12:38:08AM +1000, Nicholas Piggin wrote:
> There isn't much point trying to separate struct qnode from struct pv_node
> when struct qnode has to know about pv_node anyway.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  kernel/locking/qspinlock.c          |  3 ++-
>  kernel/locking/qspinlock_paravirt.h | 34 ++++++++++++-----------------
>  2 files changed, 16 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 65a9a10caa6f..a0fc21d99199 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -82,7 +82,8 @@
>  struct qnode {
>  	struct mcs_spinlock mcs;
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
> -	long reserved[2];
> +	int			cpu;
> +	u8			state;
>  #endif
>  };
>  
> diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
> index e84d21aa0722..b6a175155f36 100644
> --- a/kernel/locking/qspinlock_paravirt.h
> +++ b/kernel/locking/qspinlock_paravirt.h
> @@ -47,12 +47,6 @@ enum vcpu_state {
>  	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
>  };
>  
> -struct pv_node {
> -	struct mcs_spinlock	mcs;
> -	int			cpu;
> -	u8			state;
> -};
> -
>  /*
>   * Hybrid PV queued/unfair lock
>   *
> @@ -170,7 +164,7 @@ static __always_inline int trylock_clear_pending(struct qspinlock *lock)
>   */
>  struct pv_hash_entry {
>  	struct qspinlock *lock;
> -	struct pv_node   *node;
> +	struct qnode   *node;
>  };
>  
>  #define PV_HE_PER_LINE	(SMP_CACHE_BYTES / sizeof(struct pv_hash_entry))
> @@ -209,7 +203,7 @@ void __init __pv_init_lock_hash(void)
>  	     offset < (1 << pv_lock_hash_bits);						\
>  	     offset++, he = &pv_lock_hash[(hash + offset) & ((1 << pv_lock_hash_bits) - 1)])
>  
> -static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
> +static struct qspinlock **pv_hash(struct qspinlock *lock, struct qnode *node)
>  {
>  	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
>  	struct pv_hash_entry *he;
> @@ -236,11 +230,11 @@ static struct qspinlock **pv_hash(struct qspinlock *lock, struct pv_node *node)
>  	BUG();
>  }
>  
> -static struct pv_node *pv_unhash(struct qspinlock *lock)
> +static struct qnode *pv_unhash(struct qspinlock *lock)
>  {
>  	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
>  	struct pv_hash_entry *he;
> -	struct pv_node *node;
> +	struct qnode *node;
>  
>  	for_each_hash_entry(he, offset, hash) {
>  		if (READ_ONCE(he->lock) == lock) {
> @@ -264,7 +258,7 @@ static struct pv_node *pv_unhash(struct qspinlock *lock)
>   * in a running state.
>   */
>  static inline bool
> -pv_wait_early(struct pv_node *prev, int loop)
> +pv_wait_early(struct qnode *prev, int loop)
>  {
>  	if ((loop & PV_PREV_CHECK_MASK) != 0)
>  		return false;
> @@ -277,9 +271,9 @@ pv_wait_early(struct pv_node *prev, int loop)
>   */
>  static void pv_init_node(struct mcs_spinlock *node)
>  {
> -	struct pv_node *pn = (struct pv_node *)node;
> +	struct qnode *pn = (struct qnode *)node;
>  
> -	BUILD_BUG_ON(sizeof(struct pv_node) > sizeof(struct qnode));
> +	BUILD_BUG_ON(sizeof(struct qnode) > sizeof(struct qnode));

This line can actually be removed ;-)

Other part looks good to me.

Acked-by: Boqun Feng <boqun.feng@gmail.com>

Regards,
Boqun

>  
>  	pn->cpu = smp_processor_id();
>  	pn->state = vcpu_running;
> @@ -292,8 +286,8 @@ static void pv_init_node(struct mcs_spinlock *node)
>   */
>  static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
>  {
> -	struct pv_node *pn = (struct pv_node *)node;
> -	struct pv_node *pp = (struct pv_node *)prev;
> +	struct qnode *pn = (struct qnode *)node;
> +	struct qnode *pp = (struct qnode *)prev;
>  	int loop;
>  	bool wait_early;
>  
> @@ -359,7 +353,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
>   */
>  static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>  {
> -	struct pv_node *pn = (struct pv_node *)node;
> +	struct qnode *pn = (struct qnode *)node;
>  
>  	/*
>  	 * If the vCPU is indeed halted, advance its state to match that of
> @@ -402,7 +396,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
>  static u32
>  pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  {
> -	struct pv_node *pn = (struct pv_node *)node;
> +	struct qnode *pn = (struct qnode *)node;
>  	struct qspinlock **lp = NULL;
>  	int waitcnt = 0;
>  	int loop;
> @@ -492,7 +486,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
>  __visible void
>  __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
>  {
> -	struct pv_node *node;
> +	struct qnode *node;
>  
>  	if (unlikely(locked != _Q_SLOW_VAL)) {
>  		WARN(!debug_locks_silent,
> @@ -517,14 +511,14 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
>  	node = pv_unhash(lock);
>  
>  	/*
> -	 * Now that we have a reference to the (likely) blocked pv_node,
> +	 * Now that we have a reference to the (likely) blocked qnode,
>  	 * release the lock.
>  	 */
>  	smp_store_release(&lock->locked, 0);
>  
>  	/*
>  	 * At this point the memory pointed at by lock can be freed/reused,
> -	 * however we can still use the pv_node to kick the CPU.
> +	 * however we can still use the qnode to kick the CPU.
>  	 * The other vCPU may not really be halted, but kicking an active
>  	 * vCPU is harmless other than the additional latency in completing
>  	 * the unlock.
> -- 
> 2.35.1
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 02/13] locking/qspinlock: inline mcs_spinlock functions into qspinlock
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
  2022-07-04 14:38 ` [PATCH 01/13] locking/qspinlock: remove pv_node abstraction Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 16:57   ` Peter Zijlstra
  2022-07-04 14:38 ` [PATCH 03/13] locking/qspinlock: split common mcs queueing code into its own function Nicholas Piggin
                   ` (11 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

qspinlock uses mcs_spinlock for the struct type (.next, .locked, and the
misplaced .count), and arch_mcs_spin_{un}lock_contended(). These can be
trivially inlined into qspinlock, and the unused mcs_spinlock code
removed.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/alpha/include/asm/Kbuild       |   1 -
 arch/arc/include/asm/Kbuild         |   1 -
 arch/arm/include/asm/mcs_spinlock.h |  24 ------
 arch/arm64/include/asm/Kbuild       |   1 -
 arch/hexagon/include/asm/Kbuild     |   1 -
 arch/ia64/include/asm/Kbuild        |   1 -
 arch/m68k/include/asm/Kbuild        |   1 -
 arch/microblaze/include/asm/Kbuild  |   1 -
 arch/mips/include/asm/Kbuild        |   1 -
 arch/nios2/include/asm/Kbuild       |   1 -
 arch/parisc/include/asm/Kbuild      |   1 -
 arch/powerpc/include/asm/Kbuild     |   1 -
 arch/s390/include/asm/Kbuild        |   1 -
 arch/sh/include/asm/Kbuild          |   1 -
 arch/sparc/include/asm/Kbuild       |   1 -
 arch/um/include/asm/Kbuild          |   1 -
 arch/x86/include/asm/Kbuild         |   1 -
 arch/xtensa/include/asm/Kbuild      |   1 -
 include/asm-generic/mcs_spinlock.h  |  13 ---
 kernel/locking/mcs_spinlock.h       | 121 ----------------------------
 kernel/locking/qspinlock.c          |  38 ++++-----
 kernel/locking/qspinlock_paravirt.h |  55 ++++++-------
 22 files changed, 43 insertions(+), 225 deletions(-)
 delete mode 100644 arch/arm/include/asm/mcs_spinlock.h
 delete mode 100644 include/asm-generic/mcs_spinlock.h
 delete mode 100644 kernel/locking/mcs_spinlock.h

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 42911c8340c7..d21cf7b3173a 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -3,4 +3,3 @@
 generated-y += syscall_table.h
 generic-y += export.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index 3c1afa524b9c..5ae4337a9301 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 generic-y += extable.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += parport.h
 generic-y += user.h
diff --git a/arch/arm/include/asm/mcs_spinlock.h b/arch/arm/include/asm/mcs_spinlock.h
deleted file mode 100644
index 529d2cf4d06f..000000000000
--- a/arch/arm/include/asm/mcs_spinlock.h
+++ /dev/null
@@ -1,24 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_MCS_LOCK_H
-#define __ASM_MCS_LOCK_H
-
-#ifdef CONFIG_SMP
-#include <asm/spinlock.h>
-
-/* MCS spin-locking. */
-#define arch_mcs_spin_lock_contended(lock)				\
-do {									\
-	/* Ensure prior stores are observed before we enter wfe. */	\
-	smp_mb();							\
-	while (!(smp_load_acquire(lock)))				\
-		wfe();							\
-} while (0)								\
-
-#define arch_mcs_spin_unlock_contended(lock)				\
-do {									\
-	smp_store_release(lock, 1);					\
-	dsb_sev();							\
-} while (0)
-
-#endif	/* CONFIG_SMP */
-#endif	/* __ASM_MCS_LOCK_H */
diff --git a/arch/arm64/include/asm/Kbuild b/arch/arm64/include/asm/Kbuild
index 5c8ee5a541d2..57e9ad366d25 100644
--- a/arch/arm64/include/asm/Kbuild
+++ b/arch/arm64/include/asm/Kbuild
@@ -1,6 +1,5 @@
 # SPDX-License-Identifier: GPL-2.0
 generic-y += early_ioremap.h
-generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
 generic-y += parport.h
diff --git a/arch/hexagon/include/asm/Kbuild b/arch/hexagon/include/asm/Kbuild
index 3ece3c93fe08..37bbf99f66d4 100644
--- a/arch/hexagon/include/asm/Kbuild
+++ b/arch/hexagon/include/asm/Kbuild
@@ -2,4 +2,3 @@
 generic-y += extable.h
 generic-y += iomap.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
diff --git a/arch/ia64/include/asm/Kbuild b/arch/ia64/include/asm/Kbuild
index f994c1daf9d4..a0198c12e339 100644
--- a/arch/ia64/include/asm/Kbuild
+++ b/arch/ia64/include/asm/Kbuild
@@ -1,5 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 generated-y += syscall_table.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += vtime.h
diff --git a/arch/m68k/include/asm/Kbuild b/arch/m68k/include/asm/Kbuild
index 1b720299deb1..8dbef73ce01d 100644
--- a/arch/m68k/include/asm/Kbuild
+++ b/arch/m68k/include/asm/Kbuild
@@ -3,5 +3,4 @@ generated-y += syscall_table.h
 generic-y += export.h
 generic-y += extable.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += spinlock.h
diff --git a/arch/microblaze/include/asm/Kbuild b/arch/microblaze/include/asm/Kbuild
index a055f5dbe00a..7615a27e0851 100644
--- a/arch/microblaze/include/asm/Kbuild
+++ b/arch/microblaze/include/asm/Kbuild
@@ -3,7 +3,6 @@ generated-y += syscall_table.h
 generic-y += cmpxchg.h
 generic-y += extable.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += parport.h
 generic-y += syscalls.h
 generic-y += tlb.h
diff --git a/arch/mips/include/asm/Kbuild b/arch/mips/include/asm/Kbuild
index dee172716581..65cedca08771 100644
--- a/arch/mips/include/asm/Kbuild
+++ b/arch/mips/include/asm/Kbuild
@@ -9,7 +9,6 @@ generated-y += unistd_nr_o32.h
 
 generic-y += export.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += parport.h
 generic-y += qrwlock.h
 generic-y += qspinlock.h
diff --git a/arch/nios2/include/asm/Kbuild b/arch/nios2/include/asm/Kbuild
index 7fe7437555fb..5718eee9665c 100644
--- a/arch/nios2/include/asm/Kbuild
+++ b/arch/nios2/include/asm/Kbuild
@@ -2,6 +2,5 @@
 generic-y += cmpxchg.h
 generic-y += extable.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += spinlock.h
 generic-y += user.h
diff --git a/arch/parisc/include/asm/Kbuild b/arch/parisc/include/asm/Kbuild
index e6e7f74c8ac9..1f0c28d74c88 100644
--- a/arch/parisc/include/asm/Kbuild
+++ b/arch/parisc/include/asm/Kbuild
@@ -2,5 +2,4 @@
 generated-y += syscall_table_32.h
 generated-y += syscall_table_64.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += user.h
diff --git a/arch/powerpc/include/asm/Kbuild b/arch/powerpc/include/asm/Kbuild
index bcf95ce0964f..813a8c3405ad 100644
--- a/arch/powerpc/include/asm/Kbuild
+++ b/arch/powerpc/include/asm/Kbuild
@@ -4,7 +4,6 @@ generated-y += syscall_table_64.h
 generated-y += syscall_table_spu.h
 generic-y += export.h
 generic-y += kvm_types.h
-generic-y += mcs_spinlock.h
 generic-y += qrwlock.h
 generic-y += vtime.h
 generic-y += early_ioremap.h
diff --git a/arch/s390/include/asm/Kbuild b/arch/s390/include/asm/Kbuild
index 1a18d7b82f86..8b036a4ee2ca 100644
--- a/arch/s390/include/asm/Kbuild
+++ b/arch/s390/include/asm/Kbuild
@@ -7,4 +7,3 @@ generated-y += unistd_nr.h
 generic-y += asm-offsets.h
 generic-y += export.h
 generic-y += kvm_types.h
-generic-y += mcs_spinlock.h
diff --git a/arch/sh/include/asm/Kbuild b/arch/sh/include/asm/Kbuild
index fc44d9c88b41..3192f19bcf85 100644
--- a/arch/sh/include/asm/Kbuild
+++ b/arch/sh/include/asm/Kbuild
@@ -1,5 +1,4 @@
 # SPDX-License-Identifier: GPL-2.0
 generated-y += syscall_table.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += parport.h
diff --git a/arch/sparc/include/asm/Kbuild b/arch/sparc/include/asm/Kbuild
index 0b9d98ced34a..f0b913f7ba05 100644
--- a/arch/sparc/include/asm/Kbuild
+++ b/arch/sparc/include/asm/Kbuild
@@ -3,4 +3,3 @@ generated-y += syscall_table_32.h
 generated-y += syscall_table_64.h
 generic-y += export.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
diff --git a/arch/um/include/asm/Kbuild b/arch/um/include/asm/Kbuild
index b2d834a29f3a..04080c0c1aec 100644
--- a/arch/um/include/asm/Kbuild
+++ b/arch/um/include/asm/Kbuild
@@ -14,7 +14,6 @@ generic-y += hw_irq.h
 generic-y += irq_regs.h
 generic-y += irq_work.h
 generic-y += kdebug.h
-generic-y += mcs_spinlock.h
 generic-y += mmiowb.h
 generic-y += module.lds.h
 generic-y += param.h
diff --git a/arch/x86/include/asm/Kbuild b/arch/x86/include/asm/Kbuild
index 1e51650b79d7..beb7683f7b8f 100644
--- a/arch/x86/include/asm/Kbuild
+++ b/arch/x86/include/asm/Kbuild
@@ -10,4 +10,3 @@ generated-y += xen-hypercalls.h
 
 generic-y += early_ioremap.h
 generic-y += export.h
-generic-y += mcs_spinlock.h
diff --git a/arch/xtensa/include/asm/Kbuild b/arch/xtensa/include/asm/Kbuild
index fa07c686cbcc..29ae65cb68c2 100644
--- a/arch/xtensa/include/asm/Kbuild
+++ b/arch/xtensa/include/asm/Kbuild
@@ -2,7 +2,6 @@
 generated-y += syscall_table.h
 generic-y += extable.h
 generic-y += kvm_para.h
-generic-y += mcs_spinlock.h
 generic-y += param.h
 generic-y += parport.h
 generic-y += qrwlock.h
diff --git a/include/asm-generic/mcs_spinlock.h b/include/asm-generic/mcs_spinlock.h
deleted file mode 100644
index 10cd4ffc6ba2..000000000000
--- a/include/asm-generic/mcs_spinlock.h
+++ /dev/null
@@ -1,13 +0,0 @@
-#ifndef __ASM_MCS_SPINLOCK_H
-#define __ASM_MCS_SPINLOCK_H
-
-/*
- * Architectures can define their own:
- *
- *   arch_mcs_spin_lock_contended(l)
- *   arch_mcs_spin_unlock_contended(l)
- *
- * See kernel/locking/mcs_spinlock.c.
- */
-
-#endif /* __ASM_MCS_SPINLOCK_H */
diff --git a/kernel/locking/mcs_spinlock.h b/kernel/locking/mcs_spinlock.h
deleted file mode 100644
index 85251d8771d9..000000000000
--- a/kernel/locking/mcs_spinlock.h
+++ /dev/null
@@ -1,121 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * MCS lock defines
- *
- * This file contains the main data structure and API definitions of MCS lock.
- *
- * The MCS lock (proposed by Mellor-Crummey and Scott) is a simple spin-lock
- * with the desirable properties of being fair, and with each cpu trying
- * to acquire the lock spinning on a local variable.
- * It avoids expensive cache bounces that common test-and-set spin-lock
- * implementations incur.
- */
-#ifndef __LINUX_MCS_SPINLOCK_H
-#define __LINUX_MCS_SPINLOCK_H
-
-#include <asm/mcs_spinlock.h>
-
-struct mcs_spinlock {
-	struct mcs_spinlock *next;
-	int locked; /* 1 if lock acquired */
-	int count;  /* nesting count, see qspinlock.c */
-};
-
-#ifndef arch_mcs_spin_lock_contended
-/*
- * Using smp_cond_load_acquire() provides the acquire semantics
- * required so that subsequent operations happen after the
- * lock is acquired. Additionally, some architectures such as
- * ARM64 would like to do spin-waiting instead of purely
- * spinning, and smp_cond_load_acquire() provides that behavior.
- */
-#define arch_mcs_spin_lock_contended(l)					\
-do {									\
-	smp_cond_load_acquire(l, VAL);					\
-} while (0)
-#endif
-
-#ifndef arch_mcs_spin_unlock_contended
-/*
- * smp_store_release() provides a memory barrier to ensure all
- * operations in the critical section has been completed before
- * unlocking.
- */
-#define arch_mcs_spin_unlock_contended(l)				\
-	smp_store_release((l), 1)
-#endif
-
-/*
- * Note: the smp_load_acquire/smp_store_release pair is not
- * sufficient to form a full memory barrier across
- * cpus for many architectures (except x86) for mcs_unlock and mcs_lock.
- * For applications that need a full barrier across multiple cpus
- * with mcs_unlock and mcs_lock pair, smp_mb__after_unlock_lock() should be
- * used after mcs_lock.
- */
-
-/*
- * In order to acquire the lock, the caller should declare a local node and
- * pass a reference of the node to this function in addition to the lock.
- * If the lock has already been acquired, then this will proceed to spin
- * on this node->locked until the previous lock holder sets the node->locked
- * in mcs_spin_unlock().
- */
-static inline
-void mcs_spin_lock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
-{
-	struct mcs_spinlock *prev;
-
-	/* Init node */
-	node->locked = 0;
-	node->next   = NULL;
-
-	/*
-	 * We rely on the full barrier with global transitivity implied by the
-	 * below xchg() to order the initialization stores above against any
-	 * observation of @node. And to provide the ACQUIRE ordering associated
-	 * with a LOCK primitive.
-	 */
-	prev = xchg(lock, node);
-	if (likely(prev == NULL)) {
-		/*
-		 * Lock acquired, don't need to set node->locked to 1. Threads
-		 * only spin on its own node->locked value for lock acquisition.
-		 * However, since this thread can immediately acquire the lock
-		 * and does not proceed to spin on its own node->locked, this
-		 * value won't be used. If a debug mode is needed to
-		 * audit lock status, then set node->locked value here.
-		 */
-		return;
-	}
-	WRITE_ONCE(prev->next, node);
-
-	/* Wait until the lock holder passes the lock down. */
-	arch_mcs_spin_lock_contended(&node->locked);
-}
-
-/*
- * Releases the lock. The caller should pass in the corresponding node that
- * was used to acquire the lock.
- */
-static inline
-void mcs_spin_unlock(struct mcs_spinlock **lock, struct mcs_spinlock *node)
-{
-	struct mcs_spinlock *next = READ_ONCE(node->next);
-
-	if (likely(!next)) {
-		/*
-		 * Release the lock by setting it to NULL
-		 */
-		if (likely(cmpxchg_release(lock, node, NULL) == node))
-			return;
-		/* Wait until the next pointer is set */
-		while (!(next = READ_ONCE(node->next)))
-			cpu_relax();
-	}
-
-	/* Pass lock to next waiter. */
-	arch_mcs_spin_unlock_contended(&next->locked);
-}
-
-#endif /* __LINUX_MCS_SPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index a0fc21d99199..32f401e966ab 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -66,11 +66,10 @@
  *
  */
 
-#include "mcs_spinlock.h"
 #define MAX_NODES	4
 
 /*
- * On 64-bit architectures, the mcs_spinlock structure will be 16 bytes in
+ * On 64-bit architectures, the qnode structure will be 16 bytes in
  * size and four of them will fit nicely in one 64-byte cacheline. For
  * pvqspinlock, however, we need more space for extra data. To accommodate
  * that, we insert two more long words to pad it up to 32 bytes. IOW, only
@@ -80,7 +79,9 @@
  * qspinlocks.
  */
 struct qnode {
-	struct mcs_spinlock mcs;
+	struct qnode *next;
+	int locked; /* 1 if lock acquired */
+	int count;  /* nesting count */
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 	int			cpu;
 	u8			state;
@@ -124,18 +125,18 @@ static inline __pure u32 encode_tail(int cpu, int idx)
 	return tail;
 }
 
-static inline __pure struct mcs_spinlock *decode_tail(u32 tail)
+static inline __pure struct qnode *decode_tail(u32 tail)
 {
 	int cpu = (tail >> _Q_TAIL_CPU_OFFSET) - 1;
 	int idx = (tail &  _Q_TAIL_IDX_MASK) >> _Q_TAIL_IDX_OFFSET;
 
-	return per_cpu_ptr(&qnodes[idx].mcs, cpu);
+	return per_cpu_ptr(&qnodes[idx], cpu);
 }
 
 static inline __pure
-struct mcs_spinlock *grab_mcs_node(struct mcs_spinlock *base, int idx)
+struct qnode *grab_qnode(struct qnode *base, int idx)
 {
-	return &((struct qnode *)base + idx)->mcs;
+	return &base[idx];
 }
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
@@ -271,13 +272,13 @@ static __always_inline void set_locked(struct qspinlock *lock)
  * all the PV callbacks.
  */
 
-static __always_inline void __pv_init_node(struct mcs_spinlock *node) { }
-static __always_inline void __pv_wait_node(struct mcs_spinlock *node,
-					   struct mcs_spinlock *prev) { }
+static __always_inline void __pv_init_node(struct qnode *node) { }
+static __always_inline void __pv_wait_node(struct qnode *node,
+					   struct qnode *prev) { }
 static __always_inline void __pv_kick_node(struct qspinlock *lock,
-					   struct mcs_spinlock *node) { }
+					   struct qnode *node) { }
 static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
-						   struct mcs_spinlock *node)
+						   struct qnode *node)
 						   { return 0; }
 
 #define pv_enabled()		false
@@ -316,7 +317,7 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
  */
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
-	struct mcs_spinlock *prev, *next, *node;
+	struct qnode *prev, *next, *node;
 	u32 old, tail;
 	int idx;
 
@@ -399,7 +400,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 queue:
 	lockevent_inc(lock_slowpath);
 pv_queue:
-	node = this_cpu_ptr(&qnodes[0].mcs);
+	node = this_cpu_ptr(&qnodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
 
@@ -421,7 +422,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		goto release;
 	}
 
-	node = grab_mcs_node(node, idx);
+	node = grab_qnode(node, idx);
 
 	/*
 	 * Keep counts of non-zero index values:
@@ -475,7 +476,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 		WRITE_ONCE(prev->next, node);
 
 		pv_wait_node(node, prev);
-		arch_mcs_spin_lock_contended(&node->locked);
+		/* Wait for mcs node lock to be released */
+		smp_cond_load_acquire(&node->locked, VAL);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
@@ -554,7 +556,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	if (!next)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
-	arch_mcs_spin_unlock_contended(&next->locked);
+	smp_store_release(&next->locked, 1); /* unlock the mcs node lock */
 	pv_kick_node(lock, next);
 
 release:
@@ -563,7 +565,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	/*
 	 * release the node
 	 */
-	__this_cpu_dec(qnodes[0].mcs.count);
+	__this_cpu_dec(qnodes[0].count);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index b6a175155f36..cce3d3dde216 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -267,16 +267,12 @@ pv_wait_early(struct qnode *prev, int loop)
 }
 
 /*
- * Initialize the PV part of the mcs_spinlock node.
+ * Initialize the PV part of the qnode.
  */
-static void pv_init_node(struct mcs_spinlock *node)
+static void pv_init_node(struct qnode *node)
 {
-	struct qnode *pn = (struct qnode *)node;
-
-	BUILD_BUG_ON(sizeof(struct qnode) > sizeof(struct qnode));
-
-	pn->cpu = smp_processor_id();
-	pn->state = vcpu_running;
+	node->cpu = smp_processor_id();
+	node->state = vcpu_running;
 }
 
 /*
@@ -284,10 +280,8 @@ static void pv_init_node(struct mcs_spinlock *node)
  * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
  * behalf.
  */
-static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
+static void pv_wait_node(struct qnode *node, struct qnode *prev)
 {
-	struct qnode *pn = (struct qnode *)node;
-	struct qnode *pp = (struct qnode *)prev;
 	int loop;
 	bool wait_early;
 
@@ -295,7 +289,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
 			if (READ_ONCE(node->locked))
 				return;
-			if (pv_wait_early(pp, loop)) {
+			if (pv_wait_early(prev, loop)) {
 				wait_early = true;
 				break;
 			}
@@ -303,20 +297,20 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 		}
 
 		/*
-		 * Order pn->state vs pn->locked thusly:
+		 * Order node->state vs node->locked thusly:
 		 *
-		 * [S] pn->state = vcpu_halted	  [S] next->locked = 1
-		 *     MB			      MB
-		 * [L] pn->locked		[RmW] pn->state = vcpu_hashed
+		 * [S] node->state = vcpu_halted  [S] next->locked = 1
+		 *     MB                             MB
+		 * [L] node->locked             [RmW] node->state = vcpu_hashed
 		 *
 		 * Matches the cmpxchg() from pv_kick_node().
 		 */
-		smp_store_mb(pn->state, vcpu_halted);
+		smp_store_mb(node->state, vcpu_halted);
 
 		if (!READ_ONCE(node->locked)) {
 			lockevent_inc(pv_wait_node);
 			lockevent_cond_inc(pv_wait_early, wait_early);
-			pv_wait(&pn->state, vcpu_halted);
+			pv_wait(&node->state, vcpu_halted);
 		}
 
 		/*
@@ -324,7 +318,7 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
 		 * value so that pv_wait_head_or_lock() knows to not also try
 		 * to hash this lock.
 		 */
-		cmpxchg(&pn->state, vcpu_halted, vcpu_running);
+		cmpxchg(&node->state, vcpu_halted, vcpu_running);
 
 		/*
 		 * If the locked flag is still not set after wakeup, it is a
@@ -351,10 +345,8 @@ static void pv_wait_node(struct mcs_spinlock *node, struct mcs_spinlock *prev)
  * such that they're waiting in pv_wait_head_or_lock(), this avoids a
  * wake/sleep cycle.
  */
-static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
+static void pv_kick_node(struct qspinlock *lock, struct qnode *node)
 {
-	struct qnode *pn = (struct qnode *)node;
-
 	/*
 	 * If the vCPU is indeed halted, advance its state to match that of
 	 * pv_wait_node(). If OTOH this fails, the vCPU was running and will
@@ -363,15 +355,15 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
 	 *
 	 * The write to next->locked in arch_mcs_spin_unlock_contended()
-	 * must be ordered before the read of pn->state in the cmpxchg()
+	 * must be ordered before the read of node->state in the cmpxchg()
 	 * below for the code to work correctly. To guarantee full ordering
 	 * irrespective of the success or failure of the cmpxchg(),
 	 * a relaxed version with explicit barrier is used. The control
-	 * dependency will order the reading of pn->state before any
+	 * dependency will order the reading of node->state before any
 	 * subsequent writes.
 	 */
 	smp_mb__before_atomic();
-	if (cmpxchg_relaxed(&pn->state, vcpu_halted, vcpu_hashed)
+	if (cmpxchg_relaxed(&node->state, vcpu_halted, vcpu_hashed)
 	    != vcpu_halted)
 		return;
 
@@ -383,7 +375,7 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
 	 * needed.
 	 */
 	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
-	(void)pv_hash(lock, pn);
+	(void)pv_hash(lock, node);
 }
 
 /*
@@ -394,9 +386,8 @@ static void pv_kick_node(struct qspinlock *lock, struct mcs_spinlock *node)
  * The current value of the lock will be returned for additional processing.
  */
 static u32
-pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
+pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
 {
-	struct qnode *pn = (struct qnode *)node;
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
 	int loop;
@@ -405,7 +396,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 	 * If pv_kick_node() already advanced our state, we don't need to
 	 * insert ourselves into the hash table anymore.
 	 */
-	if (READ_ONCE(pn->state) == vcpu_hashed)
+	if (READ_ONCE(node->state) == vcpu_hashed)
 		lp = (struct qspinlock **)1;
 
 	/*
@@ -418,7 +409,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 		 * Set correct vCPU state to be used by queue node wait-early
 		 * mechanism.
 		 */
-		WRITE_ONCE(pn->state, vcpu_running);
+		WRITE_ONCE(node->state, vcpu_running);
 
 		/*
 		 * Set the pending bit in the active lock spinning loop to
@@ -434,7 +425,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 
 
 		if (!lp) { /* ONCE */
-			lp = pv_hash(lock, pn);
+			lp = pv_hash(lock, node);
 
 			/*
 			 * We must hash before setting _Q_SLOW_VAL, such that
@@ -458,7 +449,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct mcs_spinlock *node)
 				goto gotlock;
 			}
 		}
-		WRITE_ONCE(pn->state, vcpu_hashed);
+		WRITE_ONCE(node->state, vcpu_hashed);
 		lockevent_inc(pv_wait_head);
 		lockevent_cond_inc(pv_wait_again, waitcnt);
 		pv_wait(&lock->locked, _Q_SLOW_VAL);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 02/13] locking/qspinlock: inline mcs_spinlock functions into qspinlock
  2022-07-04 14:38 ` [PATCH 02/13] locking/qspinlock: inline mcs_spinlock functions into qspinlock Nicholas Piggin
@ 2022-07-05 16:57   ` Peter Zijlstra
  2022-07-12  0:06     ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 16:57 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 12:38:09AM +1000, Nicholas Piggin wrote:
> qspinlock uses mcs_spinlock for the struct type (.next, .locked, and the
> misplaced .count), and arch_mcs_spin_{un}lock_contended(). These can be
> trivially inlined into qspinlock, and the unused mcs_spinlock code
> removed.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---

>  arch/arm/include/asm/mcs_spinlock.h |  24 ------

> --- a/arch/arm/include/asm/mcs_spinlock.h
> +++ /dev/null
> @@ -1,24 +0,0 @@
> -/* SPDX-License-Identifier: GPL-2.0 */
> -#ifndef __ASM_MCS_LOCK_H
> -#define __ASM_MCS_LOCK_H
> -
> -#ifdef CONFIG_SMP
> -#include <asm/spinlock.h>
> -
> -/* MCS spin-locking. */
> -#define arch_mcs_spin_lock_contended(lock)				\
> -do {									\
> -	/* Ensure prior stores are observed before we enter wfe. */	\
> -	smp_mb();							\
> -	while (!(smp_load_acquire(lock)))				\
> -		wfe();							\
> -} while (0)								\
> -
> -#define arch_mcs_spin_unlock_contended(lock)				\
> -do {									\
> -	smp_store_release(lock, 1);					\
> -	dsb_sev();							\
> -} while (0)
> -
> -#endif	/* CONFIG_SMP */
> -#endif	/* __ASM_MCS_LOCK_H */

> @@ -475,7 +476,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  		WRITE_ONCE(prev->next, node);
>  
>  		pv_wait_node(node, prev);
> -		arch_mcs_spin_lock_contended(&node->locked);
> +		/* Wait for mcs node lock to be released */
> +		smp_cond_load_acquire(&node->locked, VAL);
>  
>  		/*
>  		 * While waiting for the MCS lock, the next pointer may have
> @@ -554,7 +556,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  	if (!next)
>  		next = smp_cond_load_relaxed(&node->next, (VAL));
>  
> -	arch_mcs_spin_unlock_contended(&next->locked);
> +	smp_store_release(&next->locked, 1); /* unlock the mcs node lock */

These are not equivalent. Now it so happens that ARM doesn't use
qspinlock and the other mcs lock users are gone by now, but something
like this should at least have been called out in the Changelog I think.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 02/13] locking/qspinlock: inline mcs_spinlock functions into qspinlock
  2022-07-05 16:57   ` Peter Zijlstra
@ 2022-07-12  0:06     ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Waiman Long, Ingo Molnar, Will Deacon

Excerpts from Peter Zijlstra's message of July 6, 2022 2:57 am:
> On Tue, Jul 05, 2022 at 12:38:09AM +1000, Nicholas Piggin wrote:
>> qspinlock uses mcs_spinlock for the struct type (.next, .locked, and the
>> misplaced .count), and arch_mcs_spin_{un}lock_contended(). These can be
>> trivially inlined into qspinlock, and the unused mcs_spinlock code
>> removed.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
> 
>>  arch/arm/include/asm/mcs_spinlock.h |  24 ------
> 
>> --- a/arch/arm/include/asm/mcs_spinlock.h
>> +++ /dev/null
>> @@ -1,24 +0,0 @@
>> -/* SPDX-License-Identifier: GPL-2.0 */
>> -#ifndef __ASM_MCS_LOCK_H
>> -#define __ASM_MCS_LOCK_H
>> -
>> -#ifdef CONFIG_SMP
>> -#include <asm/spinlock.h>
>> -
>> -/* MCS spin-locking. */
>> -#define arch_mcs_spin_lock_contended(lock)				\
>> -do {									\
>> -	/* Ensure prior stores are observed before we enter wfe. */	\
>> -	smp_mb();							\
>> -	while (!(smp_load_acquire(lock)))				\
>> -		wfe();							\
>> -} while (0)								\
>> -
>> -#define arch_mcs_spin_unlock_contended(lock)				\
>> -do {									\
>> -	smp_store_release(lock, 1);					\
>> -	dsb_sev();							\
>> -} while (0)
>> -
>> -#endif	/* CONFIG_SMP */
>> -#endif	/* __ASM_MCS_LOCK_H */
> 
>> @@ -475,7 +476,8 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  		WRITE_ONCE(prev->next, node);
>>  
>>  		pv_wait_node(node, prev);
>> -		arch_mcs_spin_lock_contended(&node->locked);
>> +		/* Wait for mcs node lock to be released */
>> +		smp_cond_load_acquire(&node->locked, VAL);
>>  
>>  		/*
>>  		 * While waiting for the MCS lock, the next pointer may have
>> @@ -554,7 +556,7 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>>  	if (!next)
>>  		next = smp_cond_load_relaxed(&node->next, (VAL));
>>  
>> -	arch_mcs_spin_unlock_contended(&next->locked);
>> +	smp_store_release(&next->locked, 1); /* unlock the mcs node lock */
> 
> These are not equivalent. Now it so happens that ARM doesn't use
> qspinlock and the other mcs lock users are gone by now, but something
> like this should at least have been called out in the Changelog I think.
> 

Yeah I could mention it in the changelog, I just didn't because it is
practically equivalent.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 03/13] locking/qspinlock: split common mcs queueing code into its own function
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
  2022-07-04 14:38 ` [PATCH 01/13] locking/qspinlock: remove pv_node abstraction Nicholas Piggin
  2022-07-04 14:38 ` [PATCH 02/13] locking/qspinlock: inline mcs_spinlock functions into qspinlock Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 17:01   ` Peter Zijlstra
  2022-07-04 14:38 ` [PATCH 04/13] locking/qspinlock: move pv lock word helpers into qspinlock.c Nicholas Piggin
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

pv qspinlocks jumps over a bunch of slowpath code directly to the
queueing part. Split the queueing code into its own function and call it
explicitly in each pv and !pv cases. This will help to untangle the two
cases with subsequent changes.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c | 208 +++++++++++++++++++------------------
 1 file changed, 108 insertions(+), 100 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 32f401e966ab..7360d643de29 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -294,112 +294,14 @@ static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
 
 #endif /* _GEN_PV_LOCK_SLOWPATH */
 
-/**
- * queued_spin_lock_slowpath - acquire the queued spinlock
- * @lock: Pointer to queued spinlock structure
- * @val: Current value of the queued spinlock 32-bit word
- *
- * (queue tail, pending bit, lock value)
- *
- *              fast     :    slow                                  :    unlock
- *                       :                                          :
- * uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
- *                       :       | ^--------.------.             /  :
- *                       :       v           \      \            |  :
- * pending               :    (0,1,1) +--> (0,1,0)   \           |  :
- *                       :       | ^--'              |           |  :
- *                       :       v                   |           |  :
- * uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
- *   queue               :       | ^--'                          |  :
- *                       :       v                               |  :
- * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
- *   queue               :         ^--'                             :
- */
-void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 {
 	struct qnode *prev, *next, *node;
-	u32 old, tail;
+	u32 val, old, tail;
 	int idx;
 
 	BUILD_BUG_ON(CONFIG_NR_CPUS >= (1U << _Q_TAIL_CPU_BITS));
 
-	if (pv_enabled())
-		goto pv_queue;
-
-	if (virt_spin_lock(lock))
-		return;
-
-	/*
-	 * Wait for in-progress pending->locked hand-overs with a bounded
-	 * number of spins so that we guarantee forward progress.
-	 *
-	 * 0,1,0 -> 0,0,1
-	 */
-	if (val == _Q_PENDING_VAL) {
-		int cnt = _Q_PENDING_LOOPS;
-		val = atomic_cond_read_relaxed(&lock->val,
-					       (VAL != _Q_PENDING_VAL) || !cnt--);
-	}
-
-	/*
-	 * If we observe any contention; queue.
-	 */
-	if (val & ~_Q_LOCKED_MASK)
-		goto queue;
-
-	/*
-	 * trylock || pending
-	 *
-	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
-	 */
-	val = queued_fetch_set_pending_acquire(lock);
-
-	/*
-	 * If we observe contention, there is a concurrent locker.
-	 *
-	 * Undo and queue; our setting of PENDING might have made the
-	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
-	 * on @next to become !NULL.
-	 */
-	if (unlikely(val & ~_Q_LOCKED_MASK)) {
-
-		/* Undo PENDING if we set it. */
-		if (!(val & _Q_PENDING_MASK))
-			clear_pending(lock);
-
-		goto queue;
-	}
-
-	/*
-	 * We're pending, wait for the owner to go away.
-	 *
-	 * 0,1,1 -> 0,1,0
-	 *
-	 * this wait loop must be a load-acquire such that we match the
-	 * store-release that clears the locked bit and create lock
-	 * sequentiality; this is because not all
-	 * clear_pending_set_locked() implementations imply full
-	 * barriers.
-	 */
-	if (val & _Q_LOCKED_MASK)
-		atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
-
-	/*
-	 * take ownership and clear the pending bit.
-	 *
-	 * 0,1,0 -> 0,0,1
-	 */
-	clear_pending_set_locked(lock);
-	lockevent_inc(lock_pending);
-	return;
-
-	/*
-	 * End of pending bit optimistic spinning and beginning of MCS
-	 * queuing.
-	 */
-queue:
-	lockevent_inc(lock_slowpath);
-pv_queue:
 	node = this_cpu_ptr(&qnodes[0]);
 	idx = node->count++;
 	tail = encode_tail(smp_processor_id(), idx);
@@ -567,6 +469,110 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 	__this_cpu_dec(qnodes[0].count);
 }
+
+/**
+ * queued_spin_lock_slowpath - acquire the queued spinlock
+ * @lock: Pointer to queued spinlock structure
+ * @val: Current value of the queued spinlock 32-bit word
+ *
+ * (queue tail, pending bit, lock value)
+ *
+ *              fast     :    slow                                  :    unlock
+ *                       :                                          :
+ * uncontended  (0,0,0) -:--> (0,0,1) ------------------------------:--> (*,*,0)
+ *                       :       | ^--------.------.             /  :
+ *                       :       v           \      \            |  :
+ * pending               :    (0,1,1) +--> (0,1,0)   \           |  :
+ *                       :       | ^--'              |           |  :
+ *                       :       v                   |           |  :
+ * uncontended           :    (n,x,y) +--> (n,0,0) --'           |  :
+ *   queue               :       | ^--'                          |  :
+ *                       :       v                               |  :
+ * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
+ *   queue               :         ^--'                             :
+ */
+void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+	if (pv_enabled()) {
+		queued_spin_lock_mcs_queue(lock);
+		return;
+	}
+
+	if (virt_spin_lock(lock))
+		return;
+
+	/*
+	 * Wait for in-progress pending->locked hand-overs with a bounded
+	 * number of spins so that we guarantee forward progress.
+	 *
+	 * 0,1,0 -> 0,0,1
+	 */
+	if (val == _Q_PENDING_VAL) {
+		int cnt = _Q_PENDING_LOOPS;
+		val = atomic_cond_read_relaxed(&lock->val,
+					       (VAL != _Q_PENDING_VAL) || !cnt--);
+	}
+
+	/*
+	 * If we observe any contention; queue.
+	 */
+	if (val & ~_Q_LOCKED_MASK)
+		goto queue;
+
+	/*
+	 * trylock || pending
+	 *
+	 * 0,0,* -> 0,1,* -> 0,0,1 pending, trylock
+	 */
+	val = queued_fetch_set_pending_acquire(lock);
+
+	/*
+	 * If we observe contention, there is a concurrent locker.
+	 *
+	 * Undo and queue; our setting of PENDING might have made the
+	 * n,0,0 -> 0,0,0 transition fail and it will now be waiting
+	 * on @next to become !NULL.
+	 */
+	if (unlikely(val & ~_Q_LOCKED_MASK)) {
+
+		/* Undo PENDING if we set it. */
+		if (!(val & _Q_PENDING_MASK))
+			clear_pending(lock);
+
+		goto queue;
+	}
+
+	/*
+	 * We're pending, wait for the owner to go away.
+	 *
+	 * 0,1,1 -> 0,1,0
+	 *
+	 * this wait loop must be a load-acquire such that we match the
+	 * store-release that clears the locked bit and create lock
+	 * sequentiality; this is because not all
+	 * clear_pending_set_locked() implementations imply full
+	 * barriers.
+	 */
+	if (val & _Q_LOCKED_MASK)
+		atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_MASK));
+
+	/*
+	 * take ownership and clear the pending bit.
+	 *
+	 * 0,1,0 -> 0,0,1
+	 */
+	clear_pending_set_locked(lock);
+	lockevent_inc(lock_pending);
+	return;
+
+	/*
+	 * End of pending bit optimistic spinning and beginning of MCS
+	 * queuing.
+	 */
+queue:
+	lockevent_inc(lock_slowpath);
+	queued_spin_lock_mcs_queue(lock);
+}
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
 /*
@@ -583,6 +589,8 @@ EXPORT_SYMBOL(queued_spin_lock_slowpath);
 #undef pv_kick_node
 #undef pv_wait_head_or_lock
 
+#define queued_spin_lock_mcs_queue	__pv_queued_spin_lock_mcs_queue
+
 #undef  queued_spin_lock_slowpath
 #define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/13] locking/qspinlock: split common mcs queueing code into its own function
  2022-07-04 14:38 ` [PATCH 03/13] locking/qspinlock: split common mcs queueing code into its own function Nicholas Piggin
@ 2022-07-05 17:01   ` Peter Zijlstra
  2022-07-12  0:10     ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 17:01 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 12:38:10AM +1000, Nicholas Piggin wrote:
> +void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
> +{
> +	if (pv_enabled()) {
> +		queued_spin_lock_mcs_queue(lock);
> +		return;
> +	}
> +
> +	if (virt_spin_lock(lock))
> +		return;
> +

This reminds me; at the time I meant to make queued_spin_lock_slowpath()
a static_call() and redirect the function appropriately at boot time.
But that was before static_call() was merged and I never seem to have
gotten around to doing that afterwards...


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 03/13] locking/qspinlock: split common mcs queueing code into its own function
  2022-07-05 17:01   ` Peter Zijlstra
@ 2022-07-12  0:10     ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:10 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Waiman Long, Ingo Molnar, Will Deacon

Excerpts from Peter Zijlstra's message of July 6, 2022 3:01 am:
> On Tue, Jul 05, 2022 at 12:38:10AM +1000, Nicholas Piggin wrote:
>> +void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>> +{
>> +	if (pv_enabled()) {
>> +		queued_spin_lock_mcs_queue(lock);
>> +		return;
>> +	}
>> +
>> +	if (virt_spin_lock(lock))
>> +		return;
>> +
> 
> This reminds me; at the time I meant to make queued_spin_lock_slowpath()
> a static_call() and redirect the function appropriately at boot time.
> But that was before static_call() was merged and I never seem to have
> gotten around to doing that afterwards...

Wouldn't hurt. OTOH hyper optimising the contended path is probably
almost not measurable. Optimising coherency in the contended path
absolutely, but straight line code less so. That said don't let me
stop you :)

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 04/13] locking/qspinlock: move pv lock word helpers into qspinlock.c
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (2 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 03/13] locking/qspinlock: split common mcs queueing code into its own function Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 19:34   ` Waiman Long
  2022-07-04 14:38 ` [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor Nicholas Piggin
                   ` (9 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

There is no real reason not to keep all the bit manipulation together.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c          | 107 ++++++++++++++++------------
 kernel/locking/qspinlock_paravirt.h |  51 -------------
 2 files changed, 63 insertions(+), 95 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 7360d643de29..8f2173e22479 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -141,7 +141,24 @@ struct qnode *grab_qnode(struct qnode *base, int idx)
 
 #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
 
+/**
+ * set_pending - set the pending bit.
+ * @lock: Pointer to queued spinlock structure
+ *
+ * *,0,* -> *,1,*
+ *
+ * The pending bit is used by the queue head vCPU to indicate that it
+ * is actively spinning on the lock and no lock stealing is allowed.
+ */
+static __always_inline void set_pending(struct qspinlock *lock)
+{
 #if _Q_PENDING_BITS == 8
+	WRITE_ONCE(lock->pending, 1);
+#else
+	atomic_or(_Q_PENDING_VAL, &lock->val);
+#endif
+}
+
 /**
  * clear_pending - clear the pending bit.
  * @lock: Pointer to queued spinlock structure
@@ -150,7 +167,11 @@ struct qnode *grab_qnode(struct qnode *base, int idx)
  */
 static __always_inline void clear_pending(struct qspinlock *lock)
 {
+#if _Q_PENDING_BITS == 8
 	WRITE_ONCE(lock->pending, 0);
+#else
+	atomic_andnot(_Q_PENDING_VAL, &lock->val);
+#endif
 }
 
 /**
@@ -163,74 +184,72 @@ static __always_inline void clear_pending(struct qspinlock *lock)
  */
 static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
 {
+#if _Q_PENDING_BITS == 8
 	WRITE_ONCE(lock->locked_pending, _Q_LOCKED_VAL);
+#else
+	atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, &lock->val);
+#endif
 }
 
-/*
- * xchg_tail - Put in the new queue tail code word & retrieve previous one
- * @lock : Pointer to queued spinlock structure
- * @tail : The new queue tail code word
- * Return: The previous queue tail code word
- *
- * xchg(lock, tail), which heads an address dependency
- *
- * p,*,* -> n,*,* ; prev = xchg(lock, node)
- */
-static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
-{
-	/*
-	 * We can use relaxed semantics since the caller ensures that the
-	 * MCS node is properly initialized before updating the tail.
-	 */
-	return (u32)xchg_relaxed(&lock->tail,
-				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
-}
-
-#else /* _Q_PENDING_BITS == 8 */
-
 /**
- * clear_pending - clear the pending bit.
+ * trylock_clear_pending - try to take ownership and clear the pending bit
  * @lock: Pointer to queued spinlock structure
  *
- * *,1,* -> *,0,*
+ * 0,1,0 -> 0,0,1
  */
-static __always_inline void clear_pending(struct qspinlock *lock)
+static __always_inline int trylock_clear_pending(struct qspinlock *lock)
 {
-	atomic_andnot(_Q_PENDING_VAL, &lock->val);
-}
+#if _Q_PENDING_BITS == 8
+	return !READ_ONCE(lock->locked) &&
+	       (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
+				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
+#else
+	int val = atomic_read(&lock->val);
 
-/**
- * clear_pending_set_locked - take ownership and clear the pending bit.
- * @lock: Pointer to queued spinlock structure
- *
- * *,1,0 -> *,0,1
- */
-static __always_inline void clear_pending_set_locked(struct qspinlock *lock)
-{
-	atomic_add(-_Q_PENDING_VAL + _Q_LOCKED_VAL, &lock->val);
+	for (;;) {
+		int old, new;
+
+		if (val  & _Q_LOCKED_MASK)
+			break;
+
+		/*
+		 * Try to clear pending bit & set locked bit
+		 */
+		old = val;
+		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
+		val = atomic_cmpxchg_acquire(&lock->val, old, new);
+
+		if (val == old)
+			return 1;
+	}
+	return 0;
+#endif
 }
 
-/**
+/*
  * xchg_tail - Put in the new queue tail code word & retrieve previous one
  * @lock : Pointer to queued spinlock structure
  * @tail : The new queue tail code word
  * Return: The previous queue tail code word
  *
- * xchg(lock, tail)
+ * xchg(lock, tail), which heads an address dependency
  *
  * p,*,* -> n,*,* ; prev = xchg(lock, node)
  */
 static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 {
+	/*
+	 * We can use relaxed semantics since the caller ensures that the
+	 * MCS node is properly initialized before updating the tail.
+	 */
+#if _Q_PENDING_BITS == 8
+	return (u32)xchg_relaxed(&lock->tail,
+				 tail >> _Q_TAIL_OFFSET) << _Q_TAIL_OFFSET;
+#else
 	u32 old, new, val = atomic_read(&lock->val);
 
 	for (;;) {
 		new = (val & _Q_LOCKED_PENDING_MASK) | tail;
-		/*
-		 * We can use relaxed semantics since the caller ensures that
-		 * the MCS node is properly initialized before updating the
-		 * tail.
-		 */
 		old = atomic_cmpxchg_relaxed(&lock->val, val, new);
 		if (old == val)
 			break;
@@ -238,8 +257,8 @@ static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
 		val = old;
 	}
 	return old;
+#endif
 }
-#endif /* _Q_PENDING_BITS == 8 */
 
 /**
  * queued_fetch_set_pending_acquire - fetch the whole lock value and set pending
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index cce3d3dde216..97385861adc2 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -95,57 +95,6 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 	return false;
 }
 
-/*
- * The pending bit is used by the queue head vCPU to indicate that it
- * is actively spinning on the lock and no lock stealing is allowed.
- */
-#if _Q_PENDING_BITS == 8
-static __always_inline void set_pending(struct qspinlock *lock)
-{
-	WRITE_ONCE(lock->pending, 1);
-}
-
-/*
- * The pending bit check in pv_queued_spin_steal_lock() isn't a memory
- * barrier. Therefore, an atomic cmpxchg_acquire() is used to acquire the
- * lock just to be sure that it will get it.
- */
-static __always_inline int trylock_clear_pending(struct qspinlock *lock)
-{
-	return !READ_ONCE(lock->locked) &&
-	       (cmpxchg_acquire(&lock->locked_pending, _Q_PENDING_VAL,
-				_Q_LOCKED_VAL) == _Q_PENDING_VAL);
-}
-#else /* _Q_PENDING_BITS == 8 */
-static __always_inline void set_pending(struct qspinlock *lock)
-{
-	atomic_or(_Q_PENDING_VAL, &lock->val);
-}
-
-static __always_inline int trylock_clear_pending(struct qspinlock *lock)
-{
-	int val = atomic_read(&lock->val);
-
-	for (;;) {
-		int old, new;
-
-		if (val  & _Q_LOCKED_MASK)
-			break;
-
-		/*
-		 * Try to clear pending bit & set locked bit
-		 */
-		old = val;
-		new = (val & ~_Q_PENDING_MASK) | _Q_LOCKED_VAL;
-		val = atomic_cmpxchg_acquire(&lock->val, old, new);
-
-		if (val == old)
-			return 1;
-	}
-	return 0;
-}
-#endif /* _Q_PENDING_BITS == 8 */
-
 /*
  * Lock and MCS node addresses hash table for fast lookup
  *
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/13] locking/qspinlock: move pv lock word helpers into qspinlock.c
  2022-07-04 14:38 ` [PATCH 04/13] locking/qspinlock: move pv lock word helpers into qspinlock.c Nicholas Piggin
@ 2022-07-05 19:34   ` Waiman Long
  2022-07-12  0:11     ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Waiman Long @ 2022-07-05 19:34 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel

On 7/4/22 10:38, Nicholas Piggin wrote:
> There is no real reason not to keep all the bit manipulation together.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   kernel/locking/qspinlock.c          | 107 ++++++++++++++++------------
>   kernel/locking/qspinlock_paravirt.h |  51 -------------
>   2 files changed, 63 insertions(+), 95 deletions(-)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 7360d643de29..8f2173e22479 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -141,7 +141,24 @@ struct qnode *grab_qnode(struct qnode *base, int idx)
>   
>   #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
>   
> +/**
> + * set_pending - set the pending bit.
> + * @lock: Pointer to queued spinlock structure
> + *
> + * *,0,* -> *,1,*
> + *
> + * The pending bit is used by the queue head vCPU to indicate that it
> + * is actively spinning on the lock and no lock stealing is allowed.

The pending bit has different usage in pv and non-pv cases. The 
description here refers to the pv case. For non-pv, it is used to avoid 
loading the the extra node cacheline in likely contended case.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 04/13] locking/qspinlock: move pv lock word helpers into qspinlock.c
  2022-07-05 19:34   ` Waiman Long
@ 2022-07-12  0:11     ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:11 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Ingo Molnar, Will Deacon

Excerpts from Waiman Long's message of July 6, 2022 5:34 am:
> On 7/4/22 10:38, Nicholas Piggin wrote:
>> There is no real reason not to keep all the bit manipulation together.
>>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>   kernel/locking/qspinlock.c          | 107 ++++++++++++++++------------
>>   kernel/locking/qspinlock_paravirt.h |  51 -------------
>>   2 files changed, 63 insertions(+), 95 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 7360d643de29..8f2173e22479 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -141,7 +141,24 @@ struct qnode *grab_qnode(struct qnode *base, int idx)
>>   
>>   #define _Q_LOCKED_PENDING_MASK (_Q_LOCKED_MASK | _Q_PENDING_MASK)
>>   
>> +/**
>> + * set_pending - set the pending bit.
>> + * @lock: Pointer to queued spinlock structure
>> + *
>> + * *,0,* -> *,1,*
>> + *
>> + * The pending bit is used by the queue head vCPU to indicate that it
>> + * is actively spinning on the lock and no lock stealing is allowed.
> 
> The pending bit has different usage in pv and non-pv cases. The 
> description here refers to the pv case. For non-pv, it is used to avoid 
> loading the the extra node cacheline in likely contended case.

Ah thank you, I can expand on that.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (3 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 04/13] locking/qspinlock: move pv lock word helpers into qspinlock.c Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 17:08   ` Peter Zijlstra
  2022-07-05 20:02   ` Waiman Long
  2022-07-04 14:38 ` [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c Nicholas Piggin
                   ` (8 subsequent siblings)
  13 siblings, 2 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

Stop qspinlock.c including itself and avoid most of the function
renaming with the preprocessor.

This is mostly done by having the common slowpath code take a 'bool
paravirt' argument and adjusting code based on that.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c          | 116 ++++++++++++----------------
 kernel/locking/qspinlock_paravirt.h |  10 +--
 2 files changed, 52 insertions(+), 74 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 8f2173e22479..b96c58ca51de 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -11,8 +11,6 @@
  *          Peter Zijlstra <peterz@infradead.org>
  */
 
-#ifndef _GEN_PV_LOCK_SLOWPATH
-
 #include <linux/smp.h>
 #include <linux/bug.h>
 #include <linux/cpumask.h>
@@ -285,35 +283,21 @@ static __always_inline void set_locked(struct qspinlock *lock)
 	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 }
 
-
-/*
- * Generate the native code for queued_spin_unlock_slowpath(); provide NOPs for
- * all the PV callbacks.
- */
-
-static __always_inline void __pv_init_node(struct qnode *node) { }
-static __always_inline void __pv_wait_node(struct qnode *node,
-					   struct qnode *prev) { }
-static __always_inline void __pv_kick_node(struct qspinlock *lock,
-					   struct qnode *node) { }
-static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
-						   struct qnode *node)
-						   { return 0; }
-
-#define pv_enabled()		false
-
-#define pv_init_node		__pv_init_node
-#define pv_wait_node		__pv_wait_node
-#define pv_kick_node		__pv_kick_node
-#define pv_wait_head_or_lock	__pv_wait_head_or_lock
-
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
-#endif
-
-#endif /* _GEN_PV_LOCK_SLOWPATH */
+#include "qspinlock_paravirt.h"
+#else /* CONFIG_PARAVIRT_SPINLOCKS */
+static __always_inline void pv_init_node(struct qnode *node) { }
+static __always_inline void pv_wait_node(struct qnode *node,
+					 struct qnode *prev) { }
+static __always_inline void pv_kick_node(struct qspinlock *lock,
+					 struct qnode *node) { }
+static __always_inline u32  pv_wait_head_or_lock(struct qspinlock *lock,
+						 struct qnode *node)
+						   { return 0; }
+static __always_inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock) { BUILD_BUG(); }
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
-static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
+static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
 {
 	struct qnode *prev, *next, *node;
 	u32 val, old, tail;
@@ -338,8 +322,13 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 	 */
 	if (unlikely(idx >= MAX_NODES)) {
 		lockevent_inc(lock_no_node);
-		while (!queued_spin_trylock(lock))
-			cpu_relax();
+		if (paravirt) {
+			while (!pv_hybrid_queued_unfair_trylock(lock))
+				cpu_relax();
+		} else {
+			while (!queued_spin_trylock(lock))
+				cpu_relax();
+		}
 		goto release;
 	}
 
@@ -359,15 +348,21 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 
 	node->locked = 0;
 	node->next = NULL;
-	pv_init_node(node);
+	if (paravirt)
+		pv_init_node(node);
 
 	/*
 	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
 	 * attempt the trylock once more in the hope someone let go while we
 	 * weren't watching.
 	 */
-	if (queued_spin_trylock(lock))
-		goto release;
+	if (paravirt) {
+		if (pv_hybrid_queued_unfair_trylock(lock))
+			goto release;
+	} else {
+		if (queued_spin_trylock(lock))
+			goto release;
+	}
 
 	/*
 	 * Ensure that the initialisation of @node is complete before we
@@ -396,7 +391,8 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
 
-		pv_wait_node(node, prev);
+		if (paravirt)
+			pv_wait_node(node, prev);
 		/* Wait for mcs node lock to be released */
 		smp_cond_load_acquire(&node->locked, VAL);
 
@@ -432,8 +428,10 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 	 * If PV isn't active, 0 will be returned instead.
 	 *
 	 */
-	if ((val = pv_wait_head_or_lock(lock, node)))
-		goto locked;
+	if (paravirt) {
+		if ((val = pv_wait_head_or_lock(lock, node)))
+			goto locked;
+	}
 
 	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
 
@@ -478,7 +476,8 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
 		next = smp_cond_load_relaxed(&node->next, (VAL));
 
 	smp_store_release(&next->locked, 1); /* unlock the mcs node lock */
-	pv_kick_node(lock, next);
+	if (paravirt)
+		pv_kick_node(lock, next);
 
 release:
 	trace_contention_end(lock, 0);
@@ -510,13 +509,12 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
  * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
  *   queue               :         ^--'                             :
  */
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
+#endif
+
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
-	if (pv_enabled()) {
-		queued_spin_lock_mcs_queue(lock);
-		return;
-	}
-
 	if (virt_spin_lock(lock))
 		return;
 
@@ -590,31 +588,17 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 	 */
 queue:
 	lockevent_inc(lock_slowpath);
-	queued_spin_lock_mcs_queue(lock);
+	queued_spin_lock_mcs_queue(lock, false);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
-/*
- * Generate the paravirt code for queued_spin_unlock_slowpath().
- */
-#if !defined(_GEN_PV_LOCK_SLOWPATH) && defined(CONFIG_PARAVIRT_SPINLOCKS)
-#define _GEN_PV_LOCK_SLOWPATH
-
-#undef  pv_enabled
-#define pv_enabled()	true
-
-#undef pv_init_node
-#undef pv_wait_node
-#undef pv_kick_node
-#undef pv_wait_head_or_lock
-
-#define queued_spin_lock_mcs_queue	__pv_queued_spin_lock_mcs_queue
-
-#undef  queued_spin_lock_slowpath
-#define queued_spin_lock_slowpath	__pv_queued_spin_lock_slowpath
-
-#include "qspinlock_paravirt.h"
-#include "qspinlock.c"
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#undef queued_spin_lock_slowpath
+void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+{
+	queued_spin_lock_mcs_queue(lock, true);
+}
+EXPORT_SYMBOL(__pv_queued_spin_lock_slowpath);
 
 bool nopvspin __initdata;
 static __init int parse_nopvspin(char *arg)
@@ -623,4 +607,4 @@ static __init int parse_nopvspin(char *arg)
 	return 0;
 }
 early_param("nopvspin", parse_nopvspin);
-#endif
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
index 97385861adc2..f1922e3a0f7d 100644
--- a/kernel/locking/qspinlock_paravirt.h
+++ b/kernel/locking/qspinlock_paravirt.h
@@ -1,8 +1,4 @@
 /* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _GEN_PV_LOCK_SLOWPATH
-#error "do not include this file"
-#endif
-
 #include <linux/hash.h>
 #include <linux/memblock.h>
 #include <linux/debug_locks.h>
@@ -50,9 +46,8 @@ enum vcpu_state {
 /*
  * Hybrid PV queued/unfair lock
  *
- * By replacing the regular queued_spin_trylock() with the function below,
- * it will be called once when a lock waiter enter the PV slowpath before
- * being queued.
+ * This function is called once when a lock waiter enters the PV slowpath
+ * before being queued.
  *
  * The pending bit is set by the queue head vCPU of the MCS wait queue in
  * pv_wait_head_or_lock() to signal that it is ready to spin on the lock.
@@ -71,7 +66,6 @@ enum vcpu_state {
  * queued lock (no lock starvation) and an unfair lock (good performance
  * on not heavily contended locks).
  */
-#define queued_spin_trylock(l)	pv_hybrid_queued_unfair_trylock(l)
 static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 {
 	/*
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor
  2022-07-04 14:38 ` [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor Nicholas Piggin
@ 2022-07-05 17:08   ` Peter Zijlstra
  2022-07-12  0:29     ` Nicholas Piggin
  2022-07-05 20:02   ` Waiman Long
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 17:08 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 12:38:12AM +1000, Nicholas Piggin wrote:
> Stop qspinlock.c including itself and avoid most of the function
> renaming with the preprocessor.
> 
> This is mostly done by having the common slowpath code take a 'bool
> paravirt' argument and adjusting code based on that.

What does code-gen do? Is it clever enough to constant fold the lot?

Should we be using __always_inline to ensure the compiler doesn't decide
against inlining because $raisin and then emitting extra condtionals?

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor
  2022-07-05 17:08   ` Peter Zijlstra
@ 2022-07-12  0:29     ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Waiman Long, Ingo Molnar, Will Deacon

Excerpts from Peter Zijlstra's message of July 6, 2022 3:08 am:
> On Tue, Jul 05, 2022 at 12:38:12AM +1000, Nicholas Piggin wrote:
>> Stop qspinlock.c including itself and avoid most of the function
>> renaming with the preprocessor.
>> 
>> This is mostly done by having the common slowpath code take a 'bool
>> paravirt' argument and adjusting code based on that.
> 
> What does code-gen do? Is it clever enough to constant fold the lot?
> 
> Should we be using __always_inline to ensure the compiler doesn't decide
> against inlining because $raisin and then emitting extra condtionals?
> 

It seems to fold it. Code is just different enough to make it hard
to follow what the asm differences are exactly but doesn't seem to
pass 'paravirt' anywhere.

Yes it does need __always_inline certainly. Only one path will ever
be used at runtime so any icache sharing decision by the compiler
would be wrong (and we don't care about image size very much).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor
  2022-07-04 14:38 ` [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor Nicholas Piggin
  2022-07-05 17:08   ` Peter Zijlstra
@ 2022-07-05 20:02   ` Waiman Long
  2022-07-12  0:33     ` Nicholas Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Waiman Long @ 2022-07-05 20:02 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel

On 7/4/22 10:38, Nicholas Piggin wrote:
> Stop qspinlock.c including itself and avoid most of the function
> renaming with the preprocessor.
>
> This is mostly done by having the common slowpath code take a 'bool
> paravirt' argument and adjusting code based on that.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   kernel/locking/qspinlock.c          | 116 ++++++++++++----------------
>   kernel/locking/qspinlock_paravirt.h |  10 +--
>   2 files changed, 52 insertions(+), 74 deletions(-)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 8f2173e22479..b96c58ca51de 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -11,8 +11,6 @@
>    *          Peter Zijlstra <peterz@infradead.org>
>    */
>   
> -#ifndef _GEN_PV_LOCK_SLOWPATH
> -
>   #include <linux/smp.h>
>   #include <linux/bug.h>
>   #include <linux/cpumask.h>
> @@ -285,35 +283,21 @@ static __always_inline void set_locked(struct qspinlock *lock)
>   	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
>   }
>   
> -
> -/*
> - * Generate the native code for queued_spin_unlock_slowpath(); provide NOPs for
> - * all the PV callbacks.
> - */
> -
> -static __always_inline void __pv_init_node(struct qnode *node) { }
> -static __always_inline void __pv_wait_node(struct qnode *node,
> -					   struct qnode *prev) { }
> -static __always_inline void __pv_kick_node(struct qspinlock *lock,
> -					   struct qnode *node) { }
> -static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
> -						   struct qnode *node)
> -						   { return 0; }
> -
> -#define pv_enabled()		false
> -
> -#define pv_init_node		__pv_init_node
> -#define pv_wait_node		__pv_wait_node
> -#define pv_kick_node		__pv_kick_node
> -#define pv_wait_head_or_lock	__pv_wait_head_or_lock
> -
>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
> -#define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
> -#endif
> -
> -#endif /* _GEN_PV_LOCK_SLOWPATH */
> +#include "qspinlock_paravirt.h"
> +#else /* CONFIG_PARAVIRT_SPINLOCKS */
> +static __always_inline void pv_init_node(struct qnode *node) { }
> +static __always_inline void pv_wait_node(struct qnode *node,
> +					 struct qnode *prev) { }
> +static __always_inline void pv_kick_node(struct qspinlock *lock,
> +					 struct qnode *node) { }
> +static __always_inline u32  pv_wait_head_or_lock(struct qspinlock *lock,
> +						 struct qnode *node)
> +						   { return 0; }
> +static __always_inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock) { BUILD_BUG(); }
> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
>   
> -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
> +static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)

Using "const bool paravirt" may help the compiler generating better code 
by eliminating dead one, if it is not doing that already.

>   {
>   	struct qnode *prev, *next, *node;
>   	u32 val, old, tail;
> @@ -338,8 +322,13 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>   	 */
>   	if (unlikely(idx >= MAX_NODES)) {
>   		lockevent_inc(lock_no_node);
> -		while (!queued_spin_trylock(lock))
> -			cpu_relax();
> +		if (paravirt) {
> +			while (!pv_hybrid_queued_unfair_trylock(lock))
> +				cpu_relax();
> +		} else {
> +			while (!queued_spin_trylock(lock))
> +				cpu_relax();
> +		}

The code will look a bit better if you add the following helper function 
and use it instead.

static inline bool queued_spin_trylock_common(struct qspinlock *lock, 
const bool paravirt)
{
         if (paravirt)
                 return pv_hybrid_queued_unfair_trylock(lock);
         else
                 return queued_spin_trylock(lock);
}

Cheers,
Longman


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor
  2022-07-05 20:02   ` Waiman Long
@ 2022-07-12  0:33     ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:33 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Ingo Molnar, Will Deacon

Excerpts from Waiman Long's message of July 6, 2022 6:02 am:
> On 7/4/22 10:38, Nicholas Piggin wrote:
>> Stop qspinlock.c including itself and avoid most of the function
>> renaming with the preprocessor.
>>
>> This is mostly done by having the common slowpath code take a 'bool
>> paravirt' argument and adjusting code based on that.
>>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>   kernel/locking/qspinlock.c          | 116 ++++++++++++----------------
>>   kernel/locking/qspinlock_paravirt.h |  10 +--
>>   2 files changed, 52 insertions(+), 74 deletions(-)
>>
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 8f2173e22479..b96c58ca51de 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -11,8 +11,6 @@
>>    *          Peter Zijlstra <peterz@infradead.org>
>>    */
>>   
>> -#ifndef _GEN_PV_LOCK_SLOWPATH
>> -
>>   #include <linux/smp.h>
>>   #include <linux/bug.h>
>>   #include <linux/cpumask.h>
>> @@ -285,35 +283,21 @@ static __always_inline void set_locked(struct qspinlock *lock)
>>   	WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
>>   }
>>   
>> -
>> -/*
>> - * Generate the native code for queued_spin_unlock_slowpath(); provide NOPs for
>> - * all the PV callbacks.
>> - */
>> -
>> -static __always_inline void __pv_init_node(struct qnode *node) { }
>> -static __always_inline void __pv_wait_node(struct qnode *node,
>> -					   struct qnode *prev) { }
>> -static __always_inline void __pv_kick_node(struct qspinlock *lock,
>> -					   struct qnode *node) { }
>> -static __always_inline u32  __pv_wait_head_or_lock(struct qspinlock *lock,
>> -						   struct qnode *node)
>> -						   { return 0; }
>> -
>> -#define pv_enabled()		false
>> -
>> -#define pv_init_node		__pv_init_node
>> -#define pv_wait_node		__pv_wait_node
>> -#define pv_kick_node		__pv_kick_node
>> -#define pv_wait_head_or_lock	__pv_wait_head_or_lock
>> -
>>   #ifdef CONFIG_PARAVIRT_SPINLOCKS
>> -#define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
>> -#endif
>> -
>> -#endif /* _GEN_PV_LOCK_SLOWPATH */
>> +#include "qspinlock_paravirt.h"
>> +#else /* CONFIG_PARAVIRT_SPINLOCKS */
>> +static __always_inline void pv_init_node(struct qnode *node) { }
>> +static __always_inline void pv_wait_node(struct qnode *node,
>> +					 struct qnode *prev) { }
>> +static __always_inline void pv_kick_node(struct qspinlock *lock,
>> +					 struct qnode *node) { }
>> +static __always_inline u32  pv_wait_head_or_lock(struct qspinlock *lock,
>> +						 struct qnode *node)
>> +						   { return 0; }
>> +static __always_inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock) { BUILD_BUG(); }
>> +#endif /* CONFIG_PARAVIRT_SPINLOCKS */
>>   
>> -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>> +static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
> 
> Using "const bool paravirt" may help the compiler generating better code 
> by eliminating dead one, if it is not doing that already.

I think adding always_inline per Peter's suggestion should be enough.
The compiler should be able to see it's constant. Actually I'm surprised
attribute pure is helpful within a compilation unit, I would think the
compiler should be able to deduce that too.

That said, no problem with massaging things to help the compiler DTRT.

> 
>>   {
>>   	struct qnode *prev, *next, *node;
>>   	u32 val, old, tail;
>> @@ -338,8 +322,13 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock)
>>   	 */
>>   	if (unlikely(idx >= MAX_NODES)) {
>>   		lockevent_inc(lock_no_node);
>> -		while (!queued_spin_trylock(lock))
>> -			cpu_relax();
>> +		if (paravirt) {
>> +			while (!pv_hybrid_queued_unfair_trylock(lock))
>> +				cpu_relax();
>> +		} else {
>> +			while (!queued_spin_trylock(lock))
>> +				cpu_relax();
>> +		}
> 
> The code will look a bit better if you add the following helper function 
> and use it instead.
> 
> static inline bool queued_spin_trylock_common(struct qspinlock *lock, 
> const bool paravirt)
> {
>          if (paravirt)
>                  return pv_hybrid_queued_unfair_trylock(lock);
>          else
>                  return queued_spin_trylock(lock);
> }

I did change that later on actually to always just use the basic 
trylock. This was trying to be a more mechanical conversion that didn't 
change too much.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (4 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 17:20   ` Peter Zijlstra
  2022-07-04 14:38 ` [PATCH 07/13] locking/qspinlock: remove arch qspinlock_paravirt.h includes Nicholas Piggin
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

There isn't much reason to keep these separate.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c          | 488 ++++++++++++++++++++++++++-
 kernel/locking/qspinlock_paravirt.h | 490 ----------------------------
 2 files changed, 487 insertions(+), 491 deletions(-)
 delete mode 100644 kernel/locking/qspinlock_paravirt.h

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index b96c58ca51de..9a235b0d98ca 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -16,6 +16,7 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 #include <linux/hardirq.h>
+#include <linux/memblock.h>
 #include <linux/mutex.h>
 #include <linux/prefetch.h>
 #include <asm/byteorder.h>
@@ -284,7 +285,492 @@ static __always_inline void set_locked(struct qspinlock *lock)
 }
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#include "qspinlock_paravirt.h"
+/*
+ * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead
+ * of spinning them.
+ *
+ * This relies on the architecture to provide two paravirt hypercalls:
+ *
+ *   pv_wait(u8 *ptr, u8 val) -- suspends the vcpu if *ptr == val
+ *   pv_kick(cpu)             -- wakes a suspended vcpu
+ *
+ * Using these we implement __pv_queued_spin_lock_slowpath() and
+ * __pv_queued_spin_unlock() to replace native_queued_spin_lock_slowpath() and
+ * native_queued_spin_unlock().
+ */
+
+#define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
+
+/*
+ * Queue Node Adaptive Spinning
+ *
+ * A queue node vCPU will stop spinning if the vCPU in the previous node is
+ * not running. The one lock stealing attempt allowed at slowpath entry
+ * mitigates the slight slowdown for non-overcommitted guest with this
+ * aggressive wait-early mechanism.
+ *
+ * The status of the previous node will be checked at fixed interval
+ * controlled by PV_PREV_CHECK_MASK. This is to ensure that we won't
+ * pound on the cacheline of the previous node too heavily.
+ */
+#define PV_PREV_CHECK_MASK	0xff
+
+/*
+ * Queue node uses: vcpu_running & vcpu_halted.
+ * Queue head uses: vcpu_running & vcpu_hashed.
+ */
+enum vcpu_state {
+	vcpu_running = 0,
+	vcpu_halted,		/* Used only in pv_wait_node */
+	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
+};
+
+/*
+ * Hybrid PV queued/unfair lock
+ *
+ * This function is called once when a lock waiter enters the PV slowpath
+ * before being queued.
+ *
+ * The pending bit is set by the queue head vCPU of the MCS wait queue in
+ * pv_wait_head_or_lock() to signal that it is ready to spin on the lock.
+ * When that bit becomes visible to the incoming waiters, no lock stealing
+ * is allowed. The function will return immediately to make the waiters
+ * enter the MCS wait queue. So lock starvation shouldn't happen as long
+ * as the queued mode vCPUs are actively running to set the pending bit
+ * and hence disabling lock stealing.
+ *
+ * When the pending bit isn't set, the lock waiters will stay in the unfair
+ * mode spinning on the lock unless the MCS wait queue is empty. In this
+ * case, the lock waiters will enter the queued mode slowpath trying to
+ * become the queue head and set the pending bit.
+ *
+ * This hybrid PV queued/unfair lock combines the best attributes of a
+ * queued lock (no lock starvation) and an unfair lock (good performance
+ * on not heavily contended locks).
+ */
+static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
+{
+	/*
+	 * Stay in unfair lock mode as long as queued mode waiters are
+	 * present in the MCS wait queue but the pending bit isn't set.
+	 */
+	for (;;) {
+		int val = atomic_read(&lock->val);
+
+		if (!(val & _Q_LOCKED_PENDING_MASK) &&
+		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
+			lockevent_inc(pv_lock_stealing);
+			return true;
+		}
+		if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
+			break;
+
+		cpu_relax();
+	}
+
+	return false;
+}
+
+/*
+ * Lock and MCS node addresses hash table for fast lookup
+ *
+ * Hashing is done on a per-cacheline basis to minimize the need to access
+ * more than one cacheline.
+ *
+ * Dynamically allocate a hash table big enough to hold at least 4X the
+ * number of possible cpus in the system. Allocation is done on page
+ * granularity. So the minimum number of hash buckets should be at least
+ * 256 (64-bit) or 512 (32-bit) to fully utilize a 4k page.
+ *
+ * Since we should not be holding locks from NMI context (very rare indeed) the
+ * max load factor is 0.75, which is around the point where open addressing
+ * breaks down.
+ *
+ */
+struct pv_hash_entry {
+	struct qspinlock *lock;
+	struct qnode   *node;
+};
+
+#define PV_HE_PER_LINE	(SMP_CACHE_BYTES / sizeof(struct pv_hash_entry))
+#define PV_HE_MIN	(PAGE_SIZE / sizeof(struct pv_hash_entry))
+
+static struct pv_hash_entry *pv_lock_hash;
+static unsigned int pv_lock_hash_bits __read_mostly;
+
+/*
+ * Allocate memory for the PV qspinlock hash buckets
+ *
+ * This function should be called from the paravirt spinlock initialization
+ * routine.
+ */
+void __init __pv_init_lock_hash(void)
+{
+	int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
+
+	if (pv_hash_size < PV_HE_MIN)
+		pv_hash_size = PV_HE_MIN;
+
+	/*
+	 * Allocate space from bootmem which should be page-size aligned
+	 * and hence cacheline aligned.
+	 */
+	pv_lock_hash = alloc_large_system_hash("PV qspinlock",
+					       sizeof(struct pv_hash_entry),
+					       pv_hash_size, 0,
+					       HASH_EARLY | HASH_ZERO,
+					       &pv_lock_hash_bits, NULL,
+					       pv_hash_size, pv_hash_size);
+}
+
+#define for_each_hash_entry(he, offset, hash)						\
+	for (hash &= ~(PV_HE_PER_LINE - 1), he = &pv_lock_hash[hash], offset = 0;	\
+	     offset < (1 << pv_lock_hash_bits);						\
+	     offset++, he = &pv_lock_hash[(hash + offset) & ((1 << pv_lock_hash_bits) - 1)])
+
+static struct qspinlock **pv_hash(struct qspinlock *lock, struct qnode *node)
+{
+	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
+	struct pv_hash_entry *he;
+	int hopcnt = 0;
+
+	for_each_hash_entry(he, offset, hash) {
+		hopcnt++;
+		if (!cmpxchg(&he->lock, NULL, lock)) {
+			WRITE_ONCE(he->node, node);
+			lockevent_pv_hop(hopcnt);
+			return &he->lock;
+		}
+	}
+	/*
+	 * Hard assume there is a free entry for us.
+	 *
+	 * This is guaranteed by ensuring every blocked lock only ever consumes
+	 * a single entry, and since we only have 4 nesting levels per CPU
+	 * and allocated 4*nr_possible_cpus(), this must be so.
+	 *
+	 * The single entry is guaranteed by having the lock owner unhash
+	 * before it releases.
+	 */
+	BUG();
+}
+
+static struct qnode *pv_unhash(struct qspinlock *lock)
+{
+	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
+	struct pv_hash_entry *he;
+	struct qnode *node;
+
+	for_each_hash_entry(he, offset, hash) {
+		if (READ_ONCE(he->lock) == lock) {
+			node = READ_ONCE(he->node);
+			WRITE_ONCE(he->lock, NULL);
+			return node;
+		}
+	}
+	/*
+	 * Hard assume we'll find an entry.
+	 *
+	 * This guarantees a limited lookup time and is itself guaranteed by
+	 * having the lock owner do the unhash -- IFF the unlock sees the
+	 * SLOW flag, there MUST be a hash entry.
+	 */
+	BUG();
+}
+
+/*
+ * Return true if when it is time to check the previous node which is not
+ * in a running state.
+ */
+static inline bool
+pv_wait_early(struct qnode *prev, int loop)
+{
+	if ((loop & PV_PREV_CHECK_MASK) != 0)
+		return false;
+
+	return READ_ONCE(prev->state) != vcpu_running;
+}
+
+/*
+ * Initialize the PV part of the qnode.
+ */
+static void pv_init_node(struct qnode *node)
+{
+	node->cpu = smp_processor_id();
+	node->state = vcpu_running;
+}
+
+/*
+ * Wait for node->locked to become true, halt the vcpu after a short spin.
+ * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
+ * behalf.
+ */
+static void pv_wait_node(struct qnode *node, struct qnode *prev)
+{
+	int loop;
+	bool wait_early;
+
+	for (;;) {
+		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
+			if (READ_ONCE(node->locked))
+				return;
+			if (pv_wait_early(prev, loop)) {
+				wait_early = true;
+				break;
+			}
+			cpu_relax();
+		}
+
+		/*
+		 * Order node->state vs node->locked thusly:
+		 *
+		 * [S] node->state = vcpu_halted  [S] next->locked = 1
+		 *     MB                             MB
+		 * [L] node->locked             [RmW] node->state = vcpu_hashed
+		 *
+		 * Matches the cmpxchg() from pv_kick_node().
+		 */
+		smp_store_mb(node->state, vcpu_halted);
+
+		if (!READ_ONCE(node->locked)) {
+			lockevent_inc(pv_wait_node);
+			lockevent_cond_inc(pv_wait_early, wait_early);
+			pv_wait(&node->state, vcpu_halted);
+		}
+
+		/*
+		 * If pv_kick_node() changed us to vcpu_hashed, retain that
+		 * value so that pv_wait_head_or_lock() knows to not also try
+		 * to hash this lock.
+		 */
+		cmpxchg(&node->state, vcpu_halted, vcpu_running);
+
+		/*
+		 * If the locked flag is still not set after wakeup, it is a
+		 * spurious wakeup and the vCPU should wait again. However,
+		 * there is a pretty high overhead for CPU halting and kicking.
+		 * So it is better to spin for a while in the hope that the
+		 * MCS lock will be released soon.
+		 */
+		lockevent_cond_inc(pv_spurious_wakeup,
+				  !READ_ONCE(node->locked));
+	}
+
+	/*
+	 * By now our node->locked should be 1 and our caller will not actually
+	 * spin-wait for it. We do however rely on our caller to do a
+	 * load-acquire for us.
+	 */
+}
+
+/*
+ * Called after setting next->locked = 1 when we're the lock owner.
+ *
+ * Instead of waking the waiters stuck in pv_wait_node() advance their state
+ * such that they're waiting in pv_wait_head_or_lock(), this avoids a
+ * wake/sleep cycle.
+ */
+static void pv_kick_node(struct qspinlock *lock, struct qnode *node)
+{
+	/*
+	 * If the vCPU is indeed halted, advance its state to match that of
+	 * pv_wait_node(). If OTOH this fails, the vCPU was running and will
+	 * observe its next->locked value and advance itself.
+	 *
+	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
+	 *
+	 * The write to next->locked in arch_mcs_spin_unlock_contended()
+	 * must be ordered before the read of node->state in the cmpxchg()
+	 * below for the code to work correctly. To guarantee full ordering
+	 * irrespective of the success or failure of the cmpxchg(),
+	 * a relaxed version with explicit barrier is used. The control
+	 * dependency will order the reading of node->state before any
+	 * subsequent writes.
+	 */
+	smp_mb__before_atomic();
+	if (cmpxchg_relaxed(&node->state, vcpu_halted, vcpu_hashed)
+	    != vcpu_halted)
+		return;
+
+	/*
+	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
+	 *
+	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
+	 * the hash table later on at unlock time, no atomic instruction is
+	 * needed.
+	 */
+	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
+	(void)pv_hash(lock, node);
+}
+
+/*
+ * Wait for l->locked to become clear and acquire the lock;
+ * halt the vcpu after a short spin.
+ * __pv_queued_spin_unlock() will wake us.
+ *
+ * The current value of the lock will be returned for additional processing.
+ */
+static u32
+pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
+{
+	struct qspinlock **lp = NULL;
+	int waitcnt = 0;
+	int loop;
+
+	/*
+	 * If pv_kick_node() already advanced our state, we don't need to
+	 * insert ourselves into the hash table anymore.
+	 */
+	if (READ_ONCE(node->state) == vcpu_hashed)
+		lp = (struct qspinlock **)1;
+
+	/*
+	 * Tracking # of slowpath locking operations
+	 */
+	lockevent_inc(lock_slowpath);
+
+	for (;; waitcnt++) {
+		/*
+		 * Set correct vCPU state to be used by queue node wait-early
+		 * mechanism.
+		 */
+		WRITE_ONCE(node->state, vcpu_running);
+
+		/*
+		 * Set the pending bit in the active lock spinning loop to
+		 * disable lock stealing before attempting to acquire the lock.
+		 */
+		set_pending(lock);
+		for (loop = SPIN_THRESHOLD; loop; loop--) {
+			if (trylock_clear_pending(lock))
+				goto gotlock;
+			cpu_relax();
+		}
+		clear_pending(lock);
+
+
+		if (!lp) { /* ONCE */
+			lp = pv_hash(lock, node);
+
+			/*
+			 * We must hash before setting _Q_SLOW_VAL, such that
+			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
+			 * we'll be sure to be able to observe our hash entry.
+			 *
+			 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
+			 *       MB                           RMB
+			 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
+			 *
+			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
+			 */
+			if (xchg(&lock->locked, _Q_SLOW_VAL) == 0) {
+				/*
+				 * The lock was free and now we own the lock.
+				 * Change the lock value back to _Q_LOCKED_VAL
+				 * and unhash the table.
+				 */
+				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
+				WRITE_ONCE(*lp, NULL);
+				goto gotlock;
+			}
+		}
+		WRITE_ONCE(node->state, vcpu_hashed);
+		lockevent_inc(pv_wait_head);
+		lockevent_cond_inc(pv_wait_again, waitcnt);
+		pv_wait(&lock->locked, _Q_SLOW_VAL);
+
+		/*
+		 * Because of lock stealing, the queue head vCPU may not be
+		 * able to acquire the lock before it has to wait again.
+		 */
+	}
+
+	/*
+	 * The cmpxchg() or xchg() call before coming here provides the
+	 * acquire semantics for locking. The dummy ORing of _Q_LOCKED_VAL
+	 * here is to indicate to the compiler that the value will always
+	 * be nozero to enable better code optimization.
+	 */
+gotlock:
+	return (u32)(atomic_read(&lock->val) | _Q_LOCKED_VAL);
+}
+
+/*
+ * PV versions of the unlock fastpath and slowpath functions to be used
+ * instead of queued_spin_unlock().
+ */
+__visible void
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
+{
+	struct qnode *node;
+
+	if (unlikely(locked != _Q_SLOW_VAL)) {
+		WARN(!debug_locks_silent,
+		     "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n",
+		     (unsigned long)lock, atomic_read(&lock->val));
+		return;
+	}
+
+	/*
+	 * A failed cmpxchg doesn't provide any memory-ordering guarantees,
+	 * so we need a barrier to order the read of the node data in
+	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
+	 *
+	 * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL.
+	 */
+	smp_rmb();
+
+	/*
+	 * Since the above failed to release, this must be the SLOW path.
+	 * Therefore start by looking up the blocked node and unhashing it.
+	 */
+	node = pv_unhash(lock);
+
+	/*
+	 * Now that we have a reference to the (likely) blocked qnode,
+	 * release the lock.
+	 */
+	smp_store_release(&lock->locked, 0);
+
+	/*
+	 * At this point the memory pointed at by lock can be freed/reused,
+	 * however we can still use the qnode to kick the CPU.
+	 * The other vCPU may not really be halted, but kicking an active
+	 * vCPU is harmless other than the additional latency in completing
+	 * the unlock.
+	 */
+	lockevent_inc(pv_kick_unlock);
+	pv_kick(node->cpu);
+}
+
+/*
+ * Include the architecture specific callee-save thunk of the
+ * __pv_queued_spin_unlock(). This thunk is put together with
+ * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
+ * function close to each other sharing consecutive instruction cachelines.
+ * Alternatively, architecture specific version of __pv_queued_spin_unlock()
+ * can be defined.
+ */
+#include <asm/qspinlock_paravirt.h>
+
+#ifndef __pv_queued_spin_unlock
+__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
+{
+	u8 locked;
+
+	/*
+	 * We must not unlock if SLOW, because in that case we must first
+	 * unhash. Otherwise it would be possible to have multiple @lock
+	 * entries, which would be BAD.
+	 */
+	locked = cmpxchg_release(&lock->locked, _Q_LOCKED_VAL, 0);
+	if (likely(locked == _Q_LOCKED_VAL))
+		return;
+
+	__pv_queued_spin_unlock_slowpath(lock, locked);
+}
+#endif
+
 #else /* CONFIG_PARAVIRT_SPINLOCKS */
 static __always_inline void pv_init_node(struct qnode *node) { }
 static __always_inline void pv_wait_node(struct qnode *node,
diff --git a/kernel/locking/qspinlock_paravirt.h b/kernel/locking/qspinlock_paravirt.h
deleted file mode 100644
index f1922e3a0f7d..000000000000
--- a/kernel/locking/qspinlock_paravirt.h
+++ /dev/null
@@ -1,490 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#include <linux/hash.h>
-#include <linux/memblock.h>
-#include <linux/debug_locks.h>
-
-/*
- * Implement paravirt qspinlocks; the general idea is to halt the vcpus instead
- * of spinning them.
- *
- * This relies on the architecture to provide two paravirt hypercalls:
- *
- *   pv_wait(u8 *ptr, u8 val) -- suspends the vcpu if *ptr == val
- *   pv_kick(cpu)             -- wakes a suspended vcpu
- *
- * Using these we implement __pv_queued_spin_lock_slowpath() and
- * __pv_queued_spin_unlock() to replace native_queued_spin_lock_slowpath() and
- * native_queued_spin_unlock().
- */
-
-#define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
-
-/*
- * Queue Node Adaptive Spinning
- *
- * A queue node vCPU will stop spinning if the vCPU in the previous node is
- * not running. The one lock stealing attempt allowed at slowpath entry
- * mitigates the slight slowdown for non-overcommitted guest with this
- * aggressive wait-early mechanism.
- *
- * The status of the previous node will be checked at fixed interval
- * controlled by PV_PREV_CHECK_MASK. This is to ensure that we won't
- * pound on the cacheline of the previous node too heavily.
- */
-#define PV_PREV_CHECK_MASK	0xff
-
-/*
- * Queue node uses: vcpu_running & vcpu_halted.
- * Queue head uses: vcpu_running & vcpu_hashed.
- */
-enum vcpu_state {
-	vcpu_running = 0,
-	vcpu_halted,		/* Used only in pv_wait_node */
-	vcpu_hashed,		/* = pv_hash'ed + vcpu_halted */
-};
-
-/*
- * Hybrid PV queued/unfair lock
- *
- * This function is called once when a lock waiter enters the PV slowpath
- * before being queued.
- *
- * The pending bit is set by the queue head vCPU of the MCS wait queue in
- * pv_wait_head_or_lock() to signal that it is ready to spin on the lock.
- * When that bit becomes visible to the incoming waiters, no lock stealing
- * is allowed. The function will return immediately to make the waiters
- * enter the MCS wait queue. So lock starvation shouldn't happen as long
- * as the queued mode vCPUs are actively running to set the pending bit
- * and hence disabling lock stealing.
- *
- * When the pending bit isn't set, the lock waiters will stay in the unfair
- * mode spinning on the lock unless the MCS wait queue is empty. In this
- * case, the lock waiters will enter the queued mode slowpath trying to
- * become the queue head and set the pending bit.
- *
- * This hybrid PV queued/unfair lock combines the best attributes of a
- * queued lock (no lock starvation) and an unfair lock (good performance
- * on not heavily contended locks).
- */
-static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
-{
-	/*
-	 * Stay in unfair lock mode as long as queued mode waiters are
-	 * present in the MCS wait queue but the pending bit isn't set.
-	 */
-	for (;;) {
-		int val = atomic_read(&lock->val);
-
-		if (!(val & _Q_LOCKED_PENDING_MASK) &&
-		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
-			lockevent_inc(pv_lock_stealing);
-			return true;
-		}
-		if (!(val & _Q_TAIL_MASK) || (val & _Q_PENDING_MASK))
-			break;
-
-		cpu_relax();
-	}
-
-	return false;
-}
-
-/*
- * Lock and MCS node addresses hash table for fast lookup
- *
- * Hashing is done on a per-cacheline basis to minimize the need to access
- * more than one cacheline.
- *
- * Dynamically allocate a hash table big enough to hold at least 4X the
- * number of possible cpus in the system. Allocation is done on page
- * granularity. So the minimum number of hash buckets should be at least
- * 256 (64-bit) or 512 (32-bit) to fully utilize a 4k page.
- *
- * Since we should not be holding locks from NMI context (very rare indeed) the
- * max load factor is 0.75, which is around the point where open addressing
- * breaks down.
- *
- */
-struct pv_hash_entry {
-	struct qspinlock *lock;
-	struct qnode   *node;
-};
-
-#define PV_HE_PER_LINE	(SMP_CACHE_BYTES / sizeof(struct pv_hash_entry))
-#define PV_HE_MIN	(PAGE_SIZE / sizeof(struct pv_hash_entry))
-
-static struct pv_hash_entry *pv_lock_hash;
-static unsigned int pv_lock_hash_bits __read_mostly;
-
-/*
- * Allocate memory for the PV qspinlock hash buckets
- *
- * This function should be called from the paravirt spinlock initialization
- * routine.
- */
-void __init __pv_init_lock_hash(void)
-{
-	int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
-
-	if (pv_hash_size < PV_HE_MIN)
-		pv_hash_size = PV_HE_MIN;
-
-	/*
-	 * Allocate space from bootmem which should be page-size aligned
-	 * and hence cacheline aligned.
-	 */
-	pv_lock_hash = alloc_large_system_hash("PV qspinlock",
-					       sizeof(struct pv_hash_entry),
-					       pv_hash_size, 0,
-					       HASH_EARLY | HASH_ZERO,
-					       &pv_lock_hash_bits, NULL,
-					       pv_hash_size, pv_hash_size);
-}
-
-#define for_each_hash_entry(he, offset, hash)						\
-	for (hash &= ~(PV_HE_PER_LINE - 1), he = &pv_lock_hash[hash], offset = 0;	\
-	     offset < (1 << pv_lock_hash_bits);						\
-	     offset++, he = &pv_lock_hash[(hash + offset) & ((1 << pv_lock_hash_bits) - 1)])
-
-static struct qspinlock **pv_hash(struct qspinlock *lock, struct qnode *node)
-{
-	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
-	struct pv_hash_entry *he;
-	int hopcnt = 0;
-
-	for_each_hash_entry(he, offset, hash) {
-		hopcnt++;
-		if (!cmpxchg(&he->lock, NULL, lock)) {
-			WRITE_ONCE(he->node, node);
-			lockevent_pv_hop(hopcnt);
-			return &he->lock;
-		}
-	}
-	/*
-	 * Hard assume there is a free entry for us.
-	 *
-	 * This is guaranteed by ensuring every blocked lock only ever consumes
-	 * a single entry, and since we only have 4 nesting levels per CPU
-	 * and allocated 4*nr_possible_cpus(), this must be so.
-	 *
-	 * The single entry is guaranteed by having the lock owner unhash
-	 * before it releases.
-	 */
-	BUG();
-}
-
-static struct qnode *pv_unhash(struct qspinlock *lock)
-{
-	unsigned long offset, hash = hash_ptr(lock, pv_lock_hash_bits);
-	struct pv_hash_entry *he;
-	struct qnode *node;
-
-	for_each_hash_entry(he, offset, hash) {
-		if (READ_ONCE(he->lock) == lock) {
-			node = READ_ONCE(he->node);
-			WRITE_ONCE(he->lock, NULL);
-			return node;
-		}
-	}
-	/*
-	 * Hard assume we'll find an entry.
-	 *
-	 * This guarantees a limited lookup time and is itself guaranteed by
-	 * having the lock owner do the unhash -- IFF the unlock sees the
-	 * SLOW flag, there MUST be a hash entry.
-	 */
-	BUG();
-}
-
-/*
- * Return true if when it is time to check the previous node which is not
- * in a running state.
- */
-static inline bool
-pv_wait_early(struct qnode *prev, int loop)
-{
-	if ((loop & PV_PREV_CHECK_MASK) != 0)
-		return false;
-
-	return READ_ONCE(prev->state) != vcpu_running;
-}
-
-/*
- * Initialize the PV part of the qnode.
- */
-static void pv_init_node(struct qnode *node)
-{
-	node->cpu = smp_processor_id();
-	node->state = vcpu_running;
-}
-
-/*
- * Wait for node->locked to become true, halt the vcpu after a short spin.
- * pv_kick_node() is used to set _Q_SLOW_VAL and fill in hash table on its
- * behalf.
- */
-static void pv_wait_node(struct qnode *node, struct qnode *prev)
-{
-	int loop;
-	bool wait_early;
-
-	for (;;) {
-		for (wait_early = false, loop = SPIN_THRESHOLD; loop; loop--) {
-			if (READ_ONCE(node->locked))
-				return;
-			if (pv_wait_early(prev, loop)) {
-				wait_early = true;
-				break;
-			}
-			cpu_relax();
-		}
-
-		/*
-		 * Order node->state vs node->locked thusly:
-		 *
-		 * [S] node->state = vcpu_halted  [S] next->locked = 1
-		 *     MB                             MB
-		 * [L] node->locked             [RmW] node->state = vcpu_hashed
-		 *
-		 * Matches the cmpxchg() from pv_kick_node().
-		 */
-		smp_store_mb(node->state, vcpu_halted);
-
-		if (!READ_ONCE(node->locked)) {
-			lockevent_inc(pv_wait_node);
-			lockevent_cond_inc(pv_wait_early, wait_early);
-			pv_wait(&node->state, vcpu_halted);
-		}
-
-		/*
-		 * If pv_kick_node() changed us to vcpu_hashed, retain that
-		 * value so that pv_wait_head_or_lock() knows to not also try
-		 * to hash this lock.
-		 */
-		cmpxchg(&node->state, vcpu_halted, vcpu_running);
-
-		/*
-		 * If the locked flag is still not set after wakeup, it is a
-		 * spurious wakeup and the vCPU should wait again. However,
-		 * there is a pretty high overhead for CPU halting and kicking.
-		 * So it is better to spin for a while in the hope that the
-		 * MCS lock will be released soon.
-		 */
-		lockevent_cond_inc(pv_spurious_wakeup,
-				  !READ_ONCE(node->locked));
-	}
-
-	/*
-	 * By now our node->locked should be 1 and our caller will not actually
-	 * spin-wait for it. We do however rely on our caller to do a
-	 * load-acquire for us.
-	 */
-}
-
-/*
- * Called after setting next->locked = 1 when we're the lock owner.
- *
- * Instead of waking the waiters stuck in pv_wait_node() advance their state
- * such that they're waiting in pv_wait_head_or_lock(), this avoids a
- * wake/sleep cycle.
- */
-static void pv_kick_node(struct qspinlock *lock, struct qnode *node)
-{
-	/*
-	 * If the vCPU is indeed halted, advance its state to match that of
-	 * pv_wait_node(). If OTOH this fails, the vCPU was running and will
-	 * observe its next->locked value and advance itself.
-	 *
-	 * Matches with smp_store_mb() and cmpxchg() in pv_wait_node()
-	 *
-	 * The write to next->locked in arch_mcs_spin_unlock_contended()
-	 * must be ordered before the read of node->state in the cmpxchg()
-	 * below for the code to work correctly. To guarantee full ordering
-	 * irrespective of the success or failure of the cmpxchg(),
-	 * a relaxed version with explicit barrier is used. The control
-	 * dependency will order the reading of node->state before any
-	 * subsequent writes.
-	 */
-	smp_mb__before_atomic();
-	if (cmpxchg_relaxed(&node->state, vcpu_halted, vcpu_hashed)
-	    != vcpu_halted)
-		return;
-
-	/*
-	 * Put the lock into the hash table and set the _Q_SLOW_VAL.
-	 *
-	 * As this is the same vCPU that will check the _Q_SLOW_VAL value and
-	 * the hash table later on at unlock time, no atomic instruction is
-	 * needed.
-	 */
-	WRITE_ONCE(lock->locked, _Q_SLOW_VAL);
-	(void)pv_hash(lock, node);
-}
-
-/*
- * Wait for l->locked to become clear and acquire the lock;
- * halt the vcpu after a short spin.
- * __pv_queued_spin_unlock() will wake us.
- *
- * The current value of the lock will be returned for additional processing.
- */
-static u32
-pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
-{
-	struct qspinlock **lp = NULL;
-	int waitcnt = 0;
-	int loop;
-
-	/*
-	 * If pv_kick_node() already advanced our state, we don't need to
-	 * insert ourselves into the hash table anymore.
-	 */
-	if (READ_ONCE(node->state) == vcpu_hashed)
-		lp = (struct qspinlock **)1;
-
-	/*
-	 * Tracking # of slowpath locking operations
-	 */
-	lockevent_inc(lock_slowpath);
-
-	for (;; waitcnt++) {
-		/*
-		 * Set correct vCPU state to be used by queue node wait-early
-		 * mechanism.
-		 */
-		WRITE_ONCE(node->state, vcpu_running);
-
-		/*
-		 * Set the pending bit in the active lock spinning loop to
-		 * disable lock stealing before attempting to acquire the lock.
-		 */
-		set_pending(lock);
-		for (loop = SPIN_THRESHOLD; loop; loop--) {
-			if (trylock_clear_pending(lock))
-				goto gotlock;
-			cpu_relax();
-		}
-		clear_pending(lock);
-
-
-		if (!lp) { /* ONCE */
-			lp = pv_hash(lock, node);
-
-			/*
-			 * We must hash before setting _Q_SLOW_VAL, such that
-			 * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock()
-			 * we'll be sure to be able to observe our hash entry.
-			 *
-			 *   [S] <hash>                 [Rmw] l->locked == _Q_SLOW_VAL
-			 *       MB                           RMB
-			 * [RmW] l->locked = _Q_SLOW_VAL  [L] <unhash>
-			 *
-			 * Matches the smp_rmb() in __pv_queued_spin_unlock().
-			 */
-			if (xchg(&lock->locked, _Q_SLOW_VAL) == 0) {
-				/*
-				 * The lock was free and now we own the lock.
-				 * Change the lock value back to _Q_LOCKED_VAL
-				 * and unhash the table.
-				 */
-				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
-				WRITE_ONCE(*lp, NULL);
-				goto gotlock;
-			}
-		}
-		WRITE_ONCE(node->state, vcpu_hashed);
-		lockevent_inc(pv_wait_head);
-		lockevent_cond_inc(pv_wait_again, waitcnt);
-		pv_wait(&lock->locked, _Q_SLOW_VAL);
-
-		/*
-		 * Because of lock stealing, the queue head vCPU may not be
-		 * able to acquire the lock before it has to wait again.
-		 */
-	}
-
-	/*
-	 * The cmpxchg() or xchg() call before coming here provides the
-	 * acquire semantics for locking. The dummy ORing of _Q_LOCKED_VAL
-	 * here is to indicate to the compiler that the value will always
-	 * be nozero to enable better code optimization.
-	 */
-gotlock:
-	return (u32)(atomic_read(&lock->val) | _Q_LOCKED_VAL);
-}
-
-/*
- * PV versions of the unlock fastpath and slowpath functions to be used
- * instead of queued_spin_unlock().
- */
-__visible void
-__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
-{
-	struct qnode *node;
-
-	if (unlikely(locked != _Q_SLOW_VAL)) {
-		WARN(!debug_locks_silent,
-		     "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n",
-		     (unsigned long)lock, atomic_read(&lock->val));
-		return;
-	}
-
-	/*
-	 * A failed cmpxchg doesn't provide any memory-ordering guarantees,
-	 * so we need a barrier to order the read of the node data in
-	 * pv_unhash *after* we've read the lock being _Q_SLOW_VAL.
-	 *
-	 * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL.
-	 */
-	smp_rmb();
-
-	/*
-	 * Since the above failed to release, this must be the SLOW path.
-	 * Therefore start by looking up the blocked node and unhashing it.
-	 */
-	node = pv_unhash(lock);
-
-	/*
-	 * Now that we have a reference to the (likely) blocked qnode,
-	 * release the lock.
-	 */
-	smp_store_release(&lock->locked, 0);
-
-	/*
-	 * At this point the memory pointed at by lock can be freed/reused,
-	 * however we can still use the qnode to kick the CPU.
-	 * The other vCPU may not really be halted, but kicking an active
-	 * vCPU is harmless other than the additional latency in completing
-	 * the unlock.
-	 */
-	lockevent_inc(pv_kick_unlock);
-	pv_kick(node->cpu);
-}
-
-/*
- * Include the architecture specific callee-save thunk of the
- * __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
- * function close to each other sharing consecutive instruction cachelines.
- * Alternatively, architecture specific version of __pv_queued_spin_unlock()
- * can be defined.
- */
-#include <asm/qspinlock_paravirt.h>
-
-#ifndef __pv_queued_spin_unlock
-__visible void __pv_queued_spin_unlock(struct qspinlock *lock)
-{
-	u8 locked;
-
-	/*
-	 * We must not unlock if SLOW, because in that case we must first
-	 * unhash. Otherwise it would be possible to have multiple @lock
-	 * entries, which would be BAD.
-	 */
-	locked = cmpxchg_release(&lock->locked, _Q_LOCKED_VAL, 0);
-	if (likely(locked == _Q_LOCKED_VAL))
-		return;
-
-	__pv_queued_spin_unlock_slowpath(lock, locked);
-}
-#endif /* __pv_queued_spin_unlock */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c
  2022-07-04 14:38 ` [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c Nicholas Piggin
@ 2022-07-05 17:20   ` Peter Zijlstra
  2022-07-05 17:36     ` Peter Zijlstra
  2022-07-06 13:35     ` Waiman Long
  0 siblings, 2 replies; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 17:20 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 12:38:13AM +1000, Nicholas Piggin wrote:
> There isn't much reason to keep these separate.

The reason was so that other paravirt implementations could be added.

The CNA thing was also implemented this way...

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c
  2022-07-05 17:20   ` Peter Zijlstra
@ 2022-07-05 17:36     ` Peter Zijlstra
  2022-07-12  0:46       ` Nicholas Piggin
  2022-07-06 13:35     ` Waiman Long
  1 sibling, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 17:36 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 07:20:37PM +0200, Peter Zijlstra wrote:
> On Tue, Jul 05, 2022 at 12:38:13AM +1000, Nicholas Piggin wrote:
> > There isn't much reason to keep these separate.
> 
> The reason was so that other paravirt implementations could be added.
> 
> The CNA thing was also implemented this way...

Also, (TIL) s390 seems to have a copy of all this:

  b96f7d881ad9 ("s390/spinlock: introduce spinlock wait queueing")

it might be nice if it were possible to fold that back into this with a
different paravirt implementation; but I've not looked at it in any
great detail yet.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c
  2022-07-05 17:36     ` Peter Zijlstra
@ 2022-07-12  0:46       ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Waiman Long, Ingo Molnar, Will Deacon

Excerpts from Peter Zijlstra's message of July 6, 2022 3:36 am:
> On Tue, Jul 05, 2022 at 07:20:37PM +0200, Peter Zijlstra wrote:
>> On Tue, Jul 05, 2022 at 12:38:13AM +1000, Nicholas Piggin wrote:
>> > There isn't much reason to keep these separate.
>> 
>> The reason was so that other paravirt implementations could be added.
>> 
>> The CNA thing was also implemented this way...
> 
> Also, (TIL) s390 seems to have a copy of all this:
> 
>   b96f7d881ad9 ("s390/spinlock: introduce spinlock wait queueing")

Right. powerpc is going to add another one. I've also been struggling
to make PV qspinlock work well for us and the PV hooks didn't quite
help. This series is just me dumping what I had done while working
with the code until now, feel free to take or leave it.

The PV things could easily be moved out again or conditionally
compiled when another implementation in tree comes up.

> it might be nice if it were possible to fold that back into this with a
> different paravirt implementation; but I've not looked at it in any
> great detail yet.

I would like to fold powerpc some time (or at least merge it with
s390 which it is possibly more similar with) but at the moment kind
of need prove it even works on our systems  and solves issues before
bothering other archs with it. It's not all that complicated though
so not too concerned about carrying it in arch.

https://lists.ozlabs.org/pipermail/linuxppc-dev/2022-July/245977.html

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c
  2022-07-05 17:20   ` Peter Zijlstra
  2022-07-05 17:36     ` Peter Zijlstra
@ 2022-07-06 13:35     ` Waiman Long
  2022-07-06 14:16       ` Peter Zijlstra
  1 sibling, 1 reply; 38+ messages in thread
From: Waiman Long @ 2022-07-06 13:35 UTC (permalink / raw)
  To: Peter Zijlstra, Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel

On 7/5/22 13:20, Peter Zijlstra wrote:
> On Tue, Jul 05, 2022 at 12:38:13AM +1000, Nicholas Piggin wrote:
>> There isn't much reason to keep these separate.
> The reason was so that other paravirt implementations could be added.
>
> The CNA thing was also implemented this way...

Do you have any plan to take CNA [1] some time in the future?

Anyway, the main reason the paravirt code is separated into a separated 
file is to leave only the core part in qspinlock.c so that new users are 
overwhelmed with the messy details for the paravirt code. Putting 
everything into a single can make it harder to read for the newbies.

Also eliminating the preprocessor trick will make it harder to integrate 
a different qspinlock variant, like CNA, into the code base.

Cheers,
Longman

[1] 
https://lore.kernel.org/lkml/20210514200743.3026725-1-alex.kogan@oracle.com/

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c
  2022-07-06 13:35     ` Waiman Long
@ 2022-07-06 14:16       ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-06 14:16 UTC (permalink / raw)
  To: Waiman Long
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Boqun Feng,
	linux-kernel

On Wed, Jul 06, 2022 at 09:35:48AM -0400, Waiman Long wrote:
> On 7/5/22 13:20, Peter Zijlstra wrote:
> > On Tue, Jul 05, 2022 at 12:38:13AM +1000, Nicholas Piggin wrote:
> > > There isn't much reason to keep these separate.
> > The reason was so that other paravirt implementations could be added.
> > 
> > The CNA thing was also implemented this way...
> 
> Do you have any plan to take CNA [1] some time in the future?

I do mean to look at it, but somehow I never get around to it :-(
ISTR the O(n) on unlock got fixed.

Also; and I think this is why it never really gets to the top of the
todo list, I do still feel it is fixing the wrong problem. IIRC the
whole reason for CNA is the futex hash lock and I think that a better
futex ABI would be a *much* better solution there.

> Anyway, the main reason the paravirt code is separated into a separated file
> is to leave only the core part in qspinlock.c so that new users are
> overwhelmed with the messy details for the paravirt code. Putting everything
> into a single can make it harder to read for the newbies.
> 
> Also eliminating the preprocessor trick will make it harder to integrate a
> different qspinlock variant, like CNA, into the code base.

Fixed that already: YsR8BIyrSCQ8AlEo@worktop.programming.kicks-ass.net

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 07/13] locking/qspinlock: remove arch qspinlock_paravirt.h includes
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (5 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-04 14:38 ` [PATCH 08/13] locking/qspinlock: stop renaming queued_spin_lock_slowpath to native_queued_spin_lock_slowpath Nicholas Piggin
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock_paravirt.h |  7 --
 arch/x86/include/asm/qspinlock.h              |  4 ++
 arch/x86/include/asm/qspinlock_paravirt.h     | 72 -------------------
 arch/x86/kernel/paravirt-spinlocks.c          | 71 ++++++++++++++++++
 kernel/locking/qspinlock.c                    | 11 +--
 5 files changed, 76 insertions(+), 89 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/qspinlock_paravirt.h
 delete mode 100644 arch/x86/include/asm/qspinlock_paravirt.h

diff --git a/arch/powerpc/include/asm/qspinlock_paravirt.h b/arch/powerpc/include/asm/qspinlock_paravirt.h
deleted file mode 100644
index 6b60e7736a47..000000000000
--- a/arch/powerpc/include/asm/qspinlock_paravirt.h
+++ /dev/null
@@ -1,7 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0-or-later */
-#ifndef _ASM_POWERPC_QSPINLOCK_PARAVIRT_H
-#define _ASM_POWERPC_QSPINLOCK_PARAVIRT_H
-
-EXPORT_SYMBOL(__pv_queued_spin_unlock);
-
-#endif /* _ASM_POWERPC_QSPINLOCK_PARAVIRT_H */
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index d87451df480b..7f914fe7bc30 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -34,6 +34,10 @@ extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
 extern bool nopvspin;
 
+#ifdef CONFIG_64BIT
+#define __pv_queued_spin_unlock	__pv_queued_spin_unlock
+#endif
+
 #define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock
diff --git a/arch/x86/include/asm/qspinlock_paravirt.h b/arch/x86/include/asm/qspinlock_paravirt.h
deleted file mode 100644
index 892fd8c3a6f7..000000000000
--- a/arch/x86/include/asm/qspinlock_paravirt.h
+++ /dev/null
@@ -1,72 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __ASM_QSPINLOCK_PARAVIRT_H
-#define __ASM_QSPINLOCK_PARAVIRT_H
-
-#include <asm/ibt.h>
-
-/*
- * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
- * registers. For i386, however, only 1 32-bit register needs to be saved
- * and restored. So an optimized version of __pv_queued_spin_unlock() is
- * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
- */
-#ifdef CONFIG_64BIT
-
-PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
-#define __pv_queued_spin_unlock	__pv_queued_spin_unlock
-#define PV_UNLOCK		"__raw_callee_save___pv_queued_spin_unlock"
-#define PV_UNLOCK_SLOWPATH	"__raw_callee_save___pv_queued_spin_unlock_slowpath"
-
-/*
- * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
- * which combines the registers saving trunk and the body of the following
- * C code:
- *
- * void __pv_queued_spin_unlock(struct qspinlock *lock)
- * {
- *	u8 lockval = cmpxchg(&lock->locked, _Q_LOCKED_VAL, 0);
- *
- *	if (likely(lockval == _Q_LOCKED_VAL))
- *		return;
- *	pv_queued_spin_unlock_slowpath(lock, lockval);
- * }
- *
- * For x86-64,
- *   rdi = lock              (first argument)
- *   rsi = lockval           (second argument)
- *   rdx = internal variable (set to 0)
- */
-asm    (".pushsection .text;"
-	".globl " PV_UNLOCK ";"
-	".type " PV_UNLOCK ", @function;"
-	".align 4,0x90;"
-	PV_UNLOCK ": "
-	ASM_ENDBR
-	FRAME_BEGIN
-	"push  %rdx;"
-	"mov   $0x1,%eax;"
-	"xor   %edx,%edx;"
-	LOCK_PREFIX "cmpxchg %dl,(%rdi);"
-	"cmp   $0x1,%al;"
-	"jne   .slowpath;"
-	"pop   %rdx;"
-	FRAME_END
-	ASM_RET
-	".slowpath: "
-	"push   %rsi;"
-	"movzbl %al,%esi;"
-	"call " PV_UNLOCK_SLOWPATH ";"
-	"pop    %rsi;"
-	"pop    %rdx;"
-	FRAME_END
-	ASM_RET
-	".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
-	".popsection");
-
-#else /* CONFIG_64BIT */
-
-extern void __pv_queued_spin_unlock(struct qspinlock *lock);
-PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);
-
-#endif /* CONFIG_64BIT */
-#endif
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 9e1ea99ad9df..c6a107dfe20d 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -7,6 +7,7 @@
 #include <linux/export.h>
 #include <linux/jump_label.h>
 
+#include <asm/ibt.h>
 #include <asm/paravirt.h>
 
 __visible void __native_queued_spin_unlock(struct qspinlock *lock)
@@ -15,6 +16,76 @@ __visible void __native_queued_spin_unlock(struct qspinlock *lock)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__native_queued_spin_unlock);
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+/*
+ * For x86-64, PV_CALLEE_SAVE_REGS_THUNK() saves and restores 8 64-bit
+ * registers. For i386, however, only 1 32-bit register needs to be saved
+ * and restored. So an optimized version of __pv_queued_spin_unlock() is
+ * hand-coded for 64-bit, but it isn't worthwhile to do it for 32-bit.
+ */
+#ifdef CONFIG_64BIT
+
+__visible void
+__pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked);
+
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock_slowpath);
+#define PV_UNLOCK		"__raw_callee_save___pv_queued_spin_unlock"
+#define PV_UNLOCK_SLOWPATH	"__raw_callee_save___pv_queued_spin_unlock_slowpath"
+
+/*
+ * Optimized assembly version of __raw_callee_save___pv_queued_spin_unlock
+ * which combines the registers saving trunk and the body of the following
+ * C code:
+ *
+ * void __pv_queued_spin_unlock(struct qspinlock *lock)
+ * {
+ *	u8 lockval = cmpxchg(&lock->locked, _Q_LOCKED_VAL, 0);
+ *
+ *	if (likely(lockval == _Q_LOCKED_VAL))
+ *		return;
+ *	pv_queued_spin_unlock_slowpath(lock, lockval);
+ * }
+ *
+ * For x86-64,
+ *   rdi = lock              (first argument)
+ *   rsi = lockval           (second argument)
+ *   rdx = internal variable (set to 0)
+ */
+asm    (".pushsection .text;"
+	".globl " PV_UNLOCK ";"
+	".type " PV_UNLOCK ", @function;"
+	".align 4,0x90;"
+	PV_UNLOCK ": "
+	ASM_ENDBR
+	FRAME_BEGIN
+	"push  %rdx;"
+	"mov   $0x1,%eax;"
+	"xor   %edx,%edx;"
+	LOCK_PREFIX "cmpxchg %dl,(%rdi);"
+	"cmp   $0x1,%al;"
+	"jne   .slowpath;"
+	"pop   %rdx;"
+	FRAME_END
+	ASM_RET
+	".slowpath: "
+	"push   %rsi;"
+	"movzbl %al,%esi;"
+	"call " PV_UNLOCK_SLOWPATH ";"
+	"pop    %rsi;"
+	"pop    %rdx;"
+	FRAME_END
+	ASM_RET
+	".size " PV_UNLOCK ", .-" PV_UNLOCK ";"
+	".popsection");
+
+#else /* CONFIG_64BIT */
+
+extern void __pv_queued_spin_unlock(struct qspinlock *lock);
+PV_CALLEE_SAVE_REGS_THUNK(__pv_queued_spin_unlock);
+
+#endif /* CONFIG_64BIT */
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
+
 bool pv_is_native_spin_unlock(void)
 {
 	return pv_ops.lock.queued_spin_unlock.func ==
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 9a235b0d98ca..4045b5683ecb 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -743,16 +743,6 @@ __pv_queued_spin_unlock_slowpath(struct qspinlock *lock, u8 locked)
 	pv_kick(node->cpu);
 }
 
-/*
- * Include the architecture specific callee-save thunk of the
- * __pv_queued_spin_unlock(). This thunk is put together with
- * __pv_queued_spin_unlock() to make the callee-save thunk and the real unlock
- * function close to each other sharing consecutive instruction cachelines.
- * Alternatively, architecture specific version of __pv_queued_spin_unlock()
- * can be defined.
- */
-#include <asm/qspinlock_paravirt.h>
-
 #ifndef __pv_queued_spin_unlock
 __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 {
@@ -769,6 +759,7 @@ __visible void __pv_queued_spin_unlock(struct qspinlock *lock)
 
 	__pv_queued_spin_unlock_slowpath(lock, locked);
 }
+EXPORT_SYMBOL(__pv_queued_spin_unlock);
 #endif
 
 #else /* CONFIG_PARAVIRT_SPINLOCKS */
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 08/13] locking/qspinlock: stop renaming queued_spin_lock_slowpath to native_queued_spin_lock_slowpath
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (6 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 07/13] locking/qspinlock: remove arch qspinlock_paravirt.h includes Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 17:28   ` Peter Zijlstra
  2022-07-04 14:38 ` [PATCH 09/13] locking/qspinlock: rename __pv_init_lock_hash to pv_spinlocks_init Nicholas Piggin
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

The native version can simply be queued_spin_lock_slowpath, and the
paravirt version __pv_queued_spin_lock_slowpath, which is as they are
named in the C code.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h | 38 ++++++++++------------------
 arch/x86/include/asm/qspinlock.h     | 14 +++++++---
 arch/x86/kernel/paravirt.c           |  2 +-
 kernel/locking/qspinlock.c           |  8 +-----
 4 files changed, 26 insertions(+), 36 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index b676c4fb90fd..dd231c756233 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -7,42 +7,32 @@
 
 #define _Q_PENDING_LOOPS	(1 << 9) /* not tuned */
 
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_queued_spin_unlock(struct qspinlock *lock);
+void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+void __pv_queued_spin_unlock(struct qspinlock *lock);
 
-static __always_inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+static __always_inline void queued_spin_lock(struct qspinlock *lock)
 {
-	if (!is_shared_processor())
-		native_queued_spin_lock_slowpath(lock, val);
+	u32 val = 0;
+
+	if (likely(arch_atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
+		return;
+
+	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
+		queued_spin_lock_slowpath(lock, val);
 	else
 		__pv_queued_spin_lock_slowpath(lock, val);
 }
+#define queued_spin_lock queued_spin_lock
 
-#define queued_spin_unlock queued_spin_unlock
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
-	if (!is_shared_processor())
+	if (!IS_ENABLED(CONFIG_PARAVIRT_SPINLOCKS) || !is_shared_processor())
 		smp_store_release(&lock->locked, 0);
 	else
 		__pv_queued_spin_unlock(lock);
 }
-
-#else
-extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-#endif
-
-static __always_inline void queued_spin_lock(struct qspinlock *lock)
-{
-	u32 val = 0;
-
-	if (likely(arch_atomic_try_cmpxchg_lock(&lock->val, &val, _Q_LOCKED_VAL)))
-		return;
-
-	queued_spin_lock_slowpath(lock, val);
-}
-#define queued_spin_lock queued_spin_lock
+#define queued_spin_unlock queued_spin_unlock
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 #define SPIN_THRESHOLD (1<<15) /* not tuned */
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 7f914fe7bc30..603ad61e9dfe 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -28,7 +28,7 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 }
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
 extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
@@ -38,7 +38,6 @@ extern bool nopvspin;
 #define __pv_queued_spin_unlock	__pv_queued_spin_unlock
 #endif
 
-#define	queued_spin_unlock queued_spin_unlock
 /**
  * queued_spin_unlock - release a queued spinlock
  * @lock : Pointer to queued spinlock structure
@@ -50,22 +49,29 @@ static inline void native_queued_spin_unlock(struct qspinlock *lock)
 	smp_store_release(&lock->locked, 0);
 }
 
-static inline void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
+static inline void queued_spin_lock(struct qspinlock *lock)
 {
+	int val = 0;
+
+	if (likely(atomic_try_cmpxchg_acquire(&lock->val, &val, _Q_LOCKED_VAL)))
+		return;
+
 	pv_queued_spin_lock_slowpath(lock, val);
 }
+#define queued_spin_lock queued_spin_lock
 
 static inline void queued_spin_unlock(struct qspinlock *lock)
 {
 	kcsan_release();
 	pv_queued_spin_unlock(lock);
 }
+#define	queued_spin_unlock queued_spin_unlock
 
-#define vcpu_is_preempted vcpu_is_preempted
 static inline bool vcpu_is_preempted(long cpu)
 {
 	return pv_vcpu_is_preempted(cpu);
 }
+#define vcpu_is_preempted vcpu_is_preempted
 #endif
 
 #ifdef CONFIG_PARAVIRT
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 7ca2d46c08cc..f03e2962afa8 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -384,7 +384,7 @@ struct paravirt_patch_template pv_ops = {
 #if defined(CONFIG_PARAVIRT_SPINLOCKS)
 	/* Lock ops. */
 #ifdef CONFIG_SMP
-	.lock.queued_spin_lock_slowpath	= native_queued_spin_lock_slowpath,
+	.lock.queued_spin_lock_slowpath	= queued_spin_lock_slowpath,
 	.lock.queued_spin_unlock	=
 				PV_CALLEE_SAVE(__native_queued_spin_unlock),
 	.lock.wait			= paravirt_nop,
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 4045b5683ecb..412b83040bac 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -295,8 +295,7 @@ static __always_inline void set_locked(struct qspinlock *lock)
  *   pv_kick(cpu)             -- wakes a suspended vcpu
  *
  * Using these we implement __pv_queued_spin_lock_slowpath() and
- * __pv_queued_spin_unlock() to replace native_queued_spin_lock_slowpath() and
- * native_queued_spin_unlock().
+ * __pv_queued_spin_unlock().
  */
 
 #define _Q_SLOW_VAL	(3U << _Q_LOCKED_OFFSET)
@@ -986,10 +985,6 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool parav
  * contended             :    (*,x,y) +--> (*,0,0) ---> (*,0,1) -'  :
  *   queue               :         ^--'                             :
  */
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define queued_spin_lock_slowpath	native_queued_spin_lock_slowpath
-#endif
-
 void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	if (virt_spin_lock(lock))
@@ -1070,7 +1065,6 @@ void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
-#undef queued_spin_lock_slowpath
 void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
 	queued_spin_lock_mcs_queue(lock, true);
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 08/13] locking/qspinlock: stop renaming queued_spin_lock_slowpath to native_queued_spin_lock_slowpath
  2022-07-04 14:38 ` [PATCH 08/13] locking/qspinlock: stop renaming queued_spin_lock_slowpath to native_queued_spin_lock_slowpath Nicholas Piggin
@ 2022-07-05 17:28   ` Peter Zijlstra
  0 siblings, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 17:28 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 12:38:15AM +1000, Nicholas Piggin wrote:
> The native version can simply be queued_spin_lock_slowpath, and the
> paravirt version __pv_queued_spin_lock_slowpath, which is as they are
> named in the C code.
> 

Humm... so it is the x86 paravirt convention to have native_*()
functions for everthing, which sometimes are unconditionally called
whenever the paravirt indirection is unwarranted etc..

I suppose that's not the case with the spinlock code thouhg, so yeah,
why not.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 09/13] locking/qspinlock: rename __pv_init_lock_hash to pv_spinlocks_init
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (7 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 08/13] locking/qspinlock: stop renaming queued_spin_lock_slowpath to native_queued_spin_lock_slowpath Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-04 14:38 ` [PATCH 10/13] locking/qspinlock: paravirt use simple trylock in case idx overflows Nicholas Piggin
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

The caller should not have to be aware what the implementation
initialisation does.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/qspinlock.h | 7 -------
 arch/powerpc/include/asm/spinlock.h  | 2 +-
 arch/x86/hyperv/hv_spinlock.c        | 2 +-
 arch/x86/include/asm/qspinlock.h     | 1 -
 arch/x86/kernel/kvm.c                | 2 +-
 arch/x86/xen/spinlock.c              | 2 +-
 include/asm-generic/qspinlock.h      | 6 ++++++
 kernel/locking/qspinlock.c           | 2 +-
 8 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/qspinlock.h b/arch/powerpc/include/asm/qspinlock.h
index dd231c756233..39c1c7f80579 100644
--- a/arch/powerpc/include/asm/qspinlock.h
+++ b/arch/powerpc/include/asm/qspinlock.h
@@ -53,13 +53,6 @@ static __always_inline void pv_kick(int cpu)
 	prod_cpu(cpu);
 }
 
-extern void __pv_init_lock_hash(void);
-
-static inline void pv_spinlocks_init(void)
-{
-	__pv_init_lock_hash();
-}
-
 #endif
 
 /*
diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index bd75872a6334..7dafca8e3f02 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -13,7 +13,7 @@
 /* See include/linux/spinlock.h */
 #define smp_mb__after_spinlock()	smp_mb()
 
-#ifndef CONFIG_PARAVIRT_SPINLOCKS
+#ifndef CONFIG_PPC_QUEUED_SPINLOCKS
 static inline void pv_spinlocks_init(void) { }
 #endif
 
diff --git a/arch/x86/hyperv/hv_spinlock.c b/arch/x86/hyperv/hv_spinlock.c
index 91cfe698bde0..c7b5c3211c79 100644
--- a/arch/x86/hyperv/hv_spinlock.c
+++ b/arch/x86/hyperv/hv_spinlock.c
@@ -76,7 +76,7 @@ void __init hv_init_spinlocks(void)
 	}
 	pr_info("PV spinlocks enabled\n");
 
-	__pv_init_lock_hash();
+	pv_spinlocks_init();
 	pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
 	pv_ops.lock.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
 	pv_ops.lock.wait = hv_qlock_wait;
diff --git a/arch/x86/include/asm/qspinlock.h b/arch/x86/include/asm/qspinlock.h
index 603ad61e9dfe..9a03fcc1b2b7 100644
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -29,7 +29,6 @@ static __always_inline u32 queued_fetch_set_pending_acquire(struct qspinlock *lo
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 extern void queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
-extern void __pv_init_lock_hash(void);
 extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
 extern bool nopvspin;
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 1a3658f7e6d9..98a2c4d3e91d 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -1106,7 +1106,7 @@ void __init kvm_spinlock_init(void)
 
 	pr_info("PV spinlocks enabled\n");
 
-	__pv_init_lock_hash();
+	pv_spinlocks_init();
 	pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
 	pv_ops.lock.queued_spin_unlock =
 		PV_CALLEE_SAVE(__pv_queued_spin_unlock);
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 043c73dfd2c9..5145c4aec4ea 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -135,7 +135,7 @@ void __init xen_init_spinlocks(void)
 	}
 	printk(KERN_DEBUG "xen: PV spinlocks enabled\n");
 
-	__pv_init_lock_hash();
+	pv_spinlocks_init();
 	pv_ops.lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
 	pv_ops.lock.queued_spin_unlock =
 		PV_CALLEE_SAVE(__pv_queued_spin_unlock);
diff --git a/include/asm-generic/qspinlock.h b/include/asm-generic/qspinlock.h
index 995513fa2690..e0fb29ee1adc 100644
--- a/include/asm-generic/qspinlock.h
+++ b/include/asm-generic/qspinlock.h
@@ -147,4 +147,10 @@ static __always_inline bool virt_spin_lock(struct qspinlock *lock)
 #define arch_spin_trylock(l)		queued_spin_trylock(l)
 #define arch_spin_unlock(l)		queued_spin_unlock(l)
 
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void pv_spinlocks_init(void);
+#else
+static inline void pv_spinlocks_init(void) { }
+#endif
+
 #endif /* __ASM_GENERIC_QSPINLOCK_H */
diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 412b83040bac..1d5b3443772c 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -403,7 +403,7 @@ static unsigned int pv_lock_hash_bits __read_mostly;
  * This function should be called from the paravirt spinlock initialization
  * routine.
  */
-void __init __pv_init_lock_hash(void)
+void __init pv_spinlocks_init(void)
 {
 	int pv_hash_size = ALIGN(4 * num_possible_cpus(), PV_HE_PER_LINE);
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 10/13] locking/qspinlock: paravirt use simple trylock in case idx overflows
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (8 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 09/13] locking/qspinlock: rename __pv_init_lock_hash to pv_spinlocks_init Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-04 14:38 ` [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock Nicholas Piggin
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

The mcs node overflow fallback locking case does not have to use
pv_hybrid_queued_unfair_trylock as the trylock, which reduces the
differences between pv and !pv cases.

This was likely an artifact of function renaming making it use the pv
trylock, which is not a bug but it is unexpected.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c | 9 ++-------
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 1d5b3443772c..cef0ca7d94e1 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -798,13 +798,8 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool parav
 	 */
 	if (unlikely(idx >= MAX_NODES)) {
 		lockevent_inc(lock_no_node);
-		if (paravirt) {
-			while (!pv_hybrid_queued_unfair_trylock(lock))
-				cpu_relax();
-		} else {
-			while (!queued_spin_trylock(lock))
-				cpu_relax();
-		}
+		while (!queued_spin_trylock(lock))
+			cpu_relax();
 		goto release;
 	}
 
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (9 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 10/13] locking/qspinlock: paravirt use simple trylock in case idx overflows Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 17:31   ` Peter Zijlstra
  2022-07-05 20:15   ` Waiman Long
  2022-07-04 14:38 ` [PATCH 12/13] locking/qspinlock: separate pv_wait_node from the non-paravirt path Nicholas Piggin
                   ` (2 subsequent siblings)
  13 siblings, 2 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

Rather than open-code it as necessitated by the old function-renaming
code generation that rendered queued_spin_trylock unavailable to use
here.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index cef0ca7d94e1..9db168753124 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -357,7 +357,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
 		int val = atomic_read(&lock->val);
 
 		if (!(val & _Q_LOCKED_PENDING_MASK) &&
-		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
+				queued_spin_trylock(lock)) {
 			lockevent_inc(pv_lock_stealing);
 			return true;
 		}
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock
  2022-07-04 14:38 ` [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock Nicholas Piggin
@ 2022-07-05 17:31   ` Peter Zijlstra
  2022-07-05 20:15   ` Waiman Long
  1 sibling, 0 replies; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 17:31 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 12:38:18AM +1000, Nicholas Piggin wrote:
> Rather than open-code it as necessitated by the old function-renaming
> code generation that rendered queued_spin_trylock unavailable to use
> here.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  kernel/locking/qspinlock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cef0ca7d94e1..9db168753124 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -357,7 +357,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
>  		int val = atomic_read(&lock->val);
>  
>  		if (!(val & _Q_LOCKED_PENDING_MASK) &&
> -		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
> +				queued_spin_trylock(lock)) {

Indenting went wild here; please use cino=(0:0 (if you're a vim user; I
seem to have misplaced the emacs equivalent).

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock
  2022-07-04 14:38 ` [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock Nicholas Piggin
  2022-07-05 17:31   ` Peter Zijlstra
@ 2022-07-05 20:15   ` Waiman Long
  2022-07-12  0:48     ` Nicholas Piggin
  1 sibling, 1 reply; 38+ messages in thread
From: Waiman Long @ 2022-07-05 20:15 UTC (permalink / raw)
  To: Nicholas Piggin, Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, linux-kernel

On 7/4/22 10:38, Nicholas Piggin wrote:
> Rather than open-code it as necessitated by the old function-renaming
> code generation that rendered queued_spin_trylock unavailable to use
> here.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>   kernel/locking/qspinlock.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index cef0ca7d94e1..9db168753124 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -357,7 +357,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
>   		int val = atomic_read(&lock->val);
>   
>   		if (!(val & _Q_LOCKED_PENDING_MASK) &&
> -		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
> +				queued_spin_trylock(lock)) {
>   			lockevent_inc(pv_lock_stealing);
>   			return true;
>   		}

I am not sure if the compiler will eliminate the duplicated 
atomic_read() in queued_spin_trylock(). So unless it can generate the 
same code, I would prefer to leave this alone.

Cheers,
Longman



^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock
  2022-07-05 20:15   ` Waiman Long
@ 2022-07-12  0:48     ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:48 UTC (permalink / raw)
  To: Waiman Long, Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Ingo Molnar, Will Deacon

Excerpts from Waiman Long's message of July 6, 2022 6:15 am:
> On 7/4/22 10:38, Nicholas Piggin wrote:
>> Rather than open-code it as necessitated by the old function-renaming
>> code generation that rendered queued_spin_trylock unavailable to use
>> here.
>>
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>   kernel/locking/qspinlock.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index cef0ca7d94e1..9db168753124 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -357,7 +357,7 @@ static inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock)
>>   		int val = atomic_read(&lock->val);
>>   
>>   		if (!(val & _Q_LOCKED_PENDING_MASK) &&
>> -		   (cmpxchg_acquire(&lock->locked, 0, _Q_LOCKED_VAL) == 0)) {
>> +				queued_spin_trylock(lock)) {
>>   			lockevent_inc(pv_lock_stealing);
>>   			return true;
>>   		}
> 
> I am not sure if the compiler will eliminate the duplicated 
> atomic_read() in queued_spin_trylock(). So unless it can generate the 
> same code, I would prefer to leave this alone.

Ah you're right, I had that read removed in my tree but then dropped
that change before submitting. This should have been dropped as well.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 12/13] locking/qspinlock: separate pv_wait_node from the non-paravirt path
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (10 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 17:34   ` Peter Zijlstra
  2022-07-04 14:38 ` [PATCH 13/13] locking/qspinlock: simplify pv_wait_head_or_lock calling scheme Nicholas Piggin
  2022-07-05 17:59 ` [PATCH 00/13] locking/qspinlock: simplify code generation Peter Zijlstra
  13 siblings, 1 reply; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

pv_wait_node waits until node->locked is non-zero, no need for the
pv case to wait again by also executing the !pv code path.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 9db168753124..19e2f286be0a 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -862,10 +862,11 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool parav
 		/* Link @node into the waitqueue. */
 		WRITE_ONCE(prev->next, node);
 
+		/* Wait for mcs node lock to be released */
 		if (paravirt)
 			pv_wait_node(node, prev);
-		/* Wait for mcs node lock to be released */
-		smp_cond_load_acquire(&node->locked, VAL);
+		else
+			smp_cond_load_acquire(&node->locked, VAL);
 
 		/*
 		 * While waiting for the MCS lock, the next pointer may have
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] locking/qspinlock: separate pv_wait_node from the non-paravirt path
  2022-07-04 14:38 ` [PATCH 12/13] locking/qspinlock: separate pv_wait_node from the non-paravirt path Nicholas Piggin
@ 2022-07-05 17:34   ` Peter Zijlstra
  2022-07-12  0:50     ` Nicholas Piggin
  0 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 17:34 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 12:38:19AM +1000, Nicholas Piggin wrote:
> pv_wait_node waits until node->locked is non-zero, no need for the
> pv case to wait again by also executing the !pv code path.
> 
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  kernel/locking/qspinlock.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
> index 9db168753124..19e2f286be0a 100644
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -862,10 +862,11 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool parav
>  		/* Link @node into the waitqueue. */
>  		WRITE_ONCE(prev->next, node);
>  
> +		/* Wait for mcs node lock to be released */
>  		if (paravirt)
>  			pv_wait_node(node, prev);
> -		/* Wait for mcs node lock to be released */
> -		smp_cond_load_acquire(&node->locked, VAL);
> +		else
> +			smp_cond_load_acquire(&node->locked, VAL);
>  

(from patch #6):

+static void pv_wait_node(struct qnode *node, struct qnode *prev)
+{
+       int loop;
+       bool wait_early;
+
	...
+
+       /*
+        * By now our node->locked should be 1 and our caller will not actually
+        * spin-wait for it. We do however rely on our caller to do a
+        * load-acquire for us.
+        */
+}


^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 12/13] locking/qspinlock: separate pv_wait_node from the non-paravirt path
  2022-07-05 17:34   ` Peter Zijlstra
@ 2022-07-12  0:50     ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:50 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Waiman Long, Ingo Molnar, Will Deacon

Excerpts from Peter Zijlstra's message of July 6, 2022 3:34 am:
> On Tue, Jul 05, 2022 at 12:38:19AM +1000, Nicholas Piggin wrote:
>> pv_wait_node waits until node->locked is non-zero, no need for the
>> pv case to wait again by also executing the !pv code path.
>> 
>> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
>> ---
>>  kernel/locking/qspinlock.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>> 
>> diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
>> index 9db168753124..19e2f286be0a 100644
>> --- a/kernel/locking/qspinlock.c
>> +++ b/kernel/locking/qspinlock.c
>> @@ -862,10 +862,11 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool parav
>>  		/* Link @node into the waitqueue. */
>>  		WRITE_ONCE(prev->next, node);
>>  
>> +		/* Wait for mcs node lock to be released */
>>  		if (paravirt)
>>  			pv_wait_node(node, prev);
>> -		/* Wait for mcs node lock to be released */
>> -		smp_cond_load_acquire(&node->locked, VAL);
>> +		else
>> +			smp_cond_load_acquire(&node->locked, VAL);
>>  
> 
> (from patch #6):
> 
> +static void pv_wait_node(struct qnode *node, struct qnode *prev)
> +{
> +       int loop;
> +       bool wait_early;
> +
> 	...
> +
> +       /*
> +        * By now our node->locked should be 1 and our caller will not actually
> +        * spin-wait for it. We do however rely on our caller to do a
> +        * load-acquire for us.
> +        */
> +}
> 
> 

Oh good catch, thanks so much that's a dumb bug. I'll add a 
smp_load_acquire at the end of pv_wait_node where that comment
is.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 13/13] locking/qspinlock: simplify pv_wait_head_or_lock calling scheme
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (11 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 12/13] locking/qspinlock: separate pv_wait_node from the non-paravirt path Nicholas Piggin
@ 2022-07-04 14:38 ` Nicholas Piggin
  2022-07-05 17:59 ` [PATCH 00/13] locking/qspinlock: simplify code generation Peter Zijlstra
  13 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-04 14:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicholas Piggin, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng, linux-kernel

pv_wait_head_or_lock returns the lock word value ORed with a constant,
which was done to achieve a constant folding compiler optimisation
when the code was generated for both pv and !pv cases. This is no
longer necessary with the explicit paravirt test, so make the calling
convention simpler.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 kernel/locking/qspinlock.c | 37 +++++++++++++++----------------------
 1 file changed, 15 insertions(+), 22 deletions(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index 19e2f286be0a..97f95bedfa66 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -609,8 +609,7 @@ static void pv_kick_node(struct qspinlock *lock, struct qnode *node)
  *
  * The current value of the lock will be returned for additional processing.
  */
-static u32
-pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
+static void pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
 {
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
@@ -642,7 +641,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
 		set_pending(lock);
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (trylock_clear_pending(lock))
-				goto gotlock;
+				return; /* got lock */
 			cpu_relax();
 		}
 		clear_pending(lock);
@@ -670,7 +669,7 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
 				 */
 				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
-				goto gotlock;
+				return; /* got lock */
 			}
 		}
 		WRITE_ONCE(node->state, vcpu_hashed);
@@ -686,12 +685,8 @@ pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
 
 	/*
 	 * The cmpxchg() or xchg() call before coming here provides the
-	 * acquire semantics for locking. The dummy ORing of _Q_LOCKED_VAL
-	 * here is to indicate to the compiler that the value will always
-	 * be nozero to enable better code optimization.
+	 * acquire semantics for locking.
 	 */
-gotlock:
-	return (u32)(atomic_read(&lock->val) | _Q_LOCKED_VAL);
 }
 
 /*
@@ -767,9 +762,8 @@ static __always_inline void pv_wait_node(struct qnode *node,
 					 struct qnode *prev) { }
 static __always_inline void pv_kick_node(struct qspinlock *lock,
 					 struct qnode *node) { }
-static __always_inline u32  pv_wait_head_or_lock(struct qspinlock *lock,
-						 struct qnode *node)
-						   { return 0; }
+static __always_inline void pv_wait_head_or_lock(struct qspinlock *lock,
+						 struct qnode *node) { }
 static __always_inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock) { BUILD_BUG(); }
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
@@ -890,24 +884,23 @@ static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool parav
 	 * sequentiality; this is because the set_locked() function below
 	 * does not imply a full barrier.
 	 *
-	 * The PV pv_wait_head_or_lock function, if active, will acquire
-	 * the lock and return a non-zero value. So we have to skip the
-	 * atomic_cond_read_acquire() call. As the next PV queue head hasn't
-	 * been designated yet, there is no way for the locked value to become
-	 * _Q_SLOW_VAL. So both the set_locked() and the
+	 * The PV pv_wait_head_or_lock function will acquire the lock, so
+	 * skip the atomic_cond_read_acquire() call. As the next PV queue head
+	 * hasn't been designated yet, there is no way for the locked value to
+	 * become _Q_SLOW_VAL. So both the set_locked() and the
 	 * atomic_cmpxchg_relaxed() calls will be safe.
 	 *
 	 * If PV isn't active, 0 will be returned instead.
 	 *
 	 */
 	if (paravirt) {
-		if ((val = pv_wait_head_or_lock(lock, node)))
-			goto locked;
+		pv_wait_head_or_lock(lock, node);
+		val = atomic_read(&lock->val);
+	} else {
+		val = atomic_cond_read_acquire(&lock->val,
+				!(VAL & _Q_LOCKED_PENDING_MASK));
 	}
 
-	val = atomic_cond_read_acquire(&lock->val, !(VAL & _Q_LOCKED_PENDING_MASK));
-
-locked:
 	/*
 	 * claim the lock:
 	 *
-- 
2.35.1


^ permalink raw reply related	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/13] locking/qspinlock: simplify code generation
  2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
                   ` (12 preceding siblings ...)
  2022-07-04 14:38 ` [PATCH 13/13] locking/qspinlock: simplify pv_wait_head_or_lock calling scheme Nicholas Piggin
@ 2022-07-05 17:59 ` Peter Zijlstra
  2022-07-12  0:56   ` Nicholas Piggin
  13 siblings, 1 reply; 38+ messages in thread
From: Peter Zijlstra @ 2022-07-05 17:59 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng, linux-kernel

On Tue, Jul 05, 2022 at 12:38:07AM +1000, Nicholas Piggin wrote:
> Hi,
> 
> Been recently looking a bit closer at queued spinlock code, and
> found it's a little tricky to follow especially the pv generation.
> This series tries to improve the situation. It's not well tested
> outside powerpc, but it's really the x86 pv code that is the
> other major complexity that should need some review and testing.
> Opinions?

perhaps something like so on top/instead? This would still allow
slotting in other implementations with relative ease and the compilers
should constant fold all this.

--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -609,7 +609,7 @@ static void pv_kick_node(struct qspinloc
  *
  * The current value of the lock will be returned for additional processing.
  */
-static void pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
+static u32 pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
 {
 	struct qspinlock **lp = NULL;
 	int waitcnt = 0;
@@ -641,7 +641,7 @@ static void pv_wait_head_or_lock(struct
 		set_pending(lock);
 		for (loop = SPIN_THRESHOLD; loop; loop--) {
 			if (trylock_clear_pending(lock))
-				return; /* got lock */
+				goto out; /* got lock */
 			cpu_relax();
 		}
 		clear_pending(lock);
@@ -669,7 +669,7 @@ static void pv_wait_head_or_lock(struct
 				 */
 				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
 				WRITE_ONCE(*lp, NULL);
-				return; /* got lock */
+				goto out; /* got lock */
 			}
 		}
 		WRITE_ONCE(node->state, vcpu_hashed);
@@ -683,12 +683,22 @@ static void pv_wait_head_or_lock(struct
 		 */
 	}
 
+out:
 	/*
 	 * The cmpxchg() or xchg() call before coming here provides the
 	 * acquire semantics for locking.
 	 */
+	return atomic_read(&lock->val);
 }
 
+static const struct queue_ops pv_ops = {
+	.init_node		= pv_init_node,
+	.trylock		= pv_hybrid_queued_unfair_trylock,
+	.wait_node		= pv_wait_node,
+	.wait_head_or_lock	= pv_wait_head_or_lock,
+	.kick_node		= pv_kick_node,
+};
+
 /*
  * PV versions of the unlock fastpath and slowpath functions to be used
  * instead of queued_spin_unlock().
@@ -756,18 +766,18 @@ __visible void __pv_queued_spin_unlock(s
 EXPORT_SYMBOL(__pv_queued_spin_unlock);
 #endif
 
-#else /* CONFIG_PARAVIRT_SPINLOCKS */
-static __always_inline void pv_init_node(struct qnode *node) { }
-static __always_inline void pv_wait_node(struct qnode *node,
-					 struct qnode *prev) { }
-static __always_inline void pv_kick_node(struct qspinlock *lock,
-					 struct qnode *node) { }
-static __always_inline void pv_wait_head_or_lock(struct qspinlock *lock,
-						 struct qnode *node) { }
-static __always_inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock) { BUILD_BUG(); }
 #endif /* CONFIG_PARAVIRT_SPINLOCKS */
 
-static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
+struct queue_ops {
+	void (*init_node)(struct qnode *node);
+	bool (*trylock)(struct qspinlock *lock);
+	void (*wait_node)(struct qnode *node, struct qnode *prev);
+	u32 (*wait_head_or_lock)(struct qspinlock *lock, struct qnode *node);
+	void (*kick_node)(struct qspinlock *lock, struct qnode *node);
+};
+
+static __always_inline
+void queued_spin_lock_mcs_queue(struct qspinlock *lock, const struct queue_ops *ops)
 {
 	struct qnode *prev, *next, *node;
 	u32 val, old, tail;
@@ -813,16 +823,16 @@ static inline void queued_spin_lock_mcs_
 
 	node->locked = 0;
 	node->next = NULL;
-	if (paravirt)
-		pv_init_node(node);
+	if (ops && ops->init_node)
+		ops->init_node(node);
 
 	/*
 	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
 	 * attempt the trylock once more in the hope someone let go while we
 	 * weren't watching.
 	 */
-	if (paravirt) {
-		if (pv_hybrid_queued_unfair_trylock(lock))
+	if (ops && ops->trylock) {
+		if (ops->trylock(lock))
 			goto release;
 	} else {
 		if (queued_spin_trylock(lock))
@@ -857,8 +867,8 @@ static inline void queued_spin_lock_mcs_
 		WRITE_ONCE(prev->next, node);
 
 		/* Wait for mcs node lock to be released */
-		if (paravirt)
-			pv_wait_node(node, prev);
+		if (ops && ops->wait_node)
+			ops->wait_node(node, prev);
 		else
 			smp_cond_load_acquire(&node->locked, VAL);
 
@@ -893,12 +903,11 @@ static inline void queued_spin_lock_mcs_
 	 * If PV isn't active, 0 will be returned instead.
 	 *
 	 */
-	if (paravirt) {
-		pv_wait_head_or_lock(lock, node);
-		val = atomic_read(&lock->val);
+	if (ops && ops->wait_head_or_lock) {
+		val = ops->wait_head_or_lock(lock, node);
 	} else {
 		val = atomic_cond_read_acquire(&lock->val,
-				!(VAL & _Q_LOCKED_PENDING_MASK));
+					       !(VAL & _Q_LOCKED_PENDING_MASK));
 	}
 
 	/*
@@ -1049,14 +1058,14 @@ void queued_spin_lock_slowpath(struct qs
 	 */
 queue:
 	lockevent_inc(lock_slowpath);
-	queued_spin_lock_mcs_queue(lock, false);
+	queued_spin_lock_mcs_queue(lock, NULL);
 }
 EXPORT_SYMBOL(queued_spin_lock_slowpath);
 
 #ifdef CONFIG_PARAVIRT_SPINLOCKS
 void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
 {
-	queued_spin_lock_mcs_queue(lock, true);
+	queued_spin_lock_mcs_queue(lock, &pv_ops);
 }
 EXPORT_SYMBOL(__pv_queued_spin_lock_slowpath);
 

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH 00/13] locking/qspinlock: simplify code generation
  2022-07-05 17:59 ` [PATCH 00/13] locking/qspinlock: simplify code generation Peter Zijlstra
@ 2022-07-12  0:56   ` Nicholas Piggin
  0 siblings, 0 replies; 38+ messages in thread
From: Nicholas Piggin @ 2022-07-12  0:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Boqun Feng, linux-kernel, Waiman Long, Ingo Molnar, Will Deacon

Excerpts from Peter Zijlstra's message of July 6, 2022 3:59 am:
> On Tue, Jul 05, 2022 at 12:38:07AM +1000, Nicholas Piggin wrote:
>> Hi,
>> 
>> Been recently looking a bit closer at queued spinlock code, and
>> found it's a little tricky to follow especially the pv generation.
>> This series tries to improve the situation. It's not well tested
>> outside powerpc, but it's really the x86 pv code that is the
>> other major complexity that should need some review and testing.
>> Opinions?
> 
> perhaps something like so on top/instead? This would still allow
> slotting in other implementations with relative ease and the compilers
> should constant fold all this.

Yeah that could be a bit neater... I don't know. It all has to be
inlined and compiled together so it's a matter of taste in syntactic
sugar. Doing it with C is probably better than doing it with CPP,
all else being equal.

At the moment I'm not planning to replace the PV functions on powerpc
though. If/when it comes to that I'd say more changes would be needed.

Thanks,
Nick

> 
> --- a/kernel/locking/qspinlock.c
> +++ b/kernel/locking/qspinlock.c
> @@ -609,7 +609,7 @@ static void pv_kick_node(struct qspinloc
>   *
>   * The current value of the lock will be returned for additional processing.
>   */
> -static void pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
> +static u32 pv_wait_head_or_lock(struct qspinlock *lock, struct qnode *node)
>  {
>  	struct qspinlock **lp = NULL;
>  	int waitcnt = 0;
> @@ -641,7 +641,7 @@ static void pv_wait_head_or_lock(struct
>  		set_pending(lock);
>  		for (loop = SPIN_THRESHOLD; loop; loop--) {
>  			if (trylock_clear_pending(lock))
> -				return; /* got lock */
> +				goto out; /* got lock */
>  			cpu_relax();
>  		}
>  		clear_pending(lock);
> @@ -669,7 +669,7 @@ static void pv_wait_head_or_lock(struct
>  				 */
>  				WRITE_ONCE(lock->locked, _Q_LOCKED_VAL);
>  				WRITE_ONCE(*lp, NULL);
> -				return; /* got lock */
> +				goto out; /* got lock */
>  			}
>  		}
>  		WRITE_ONCE(node->state, vcpu_hashed);
> @@ -683,12 +683,22 @@ static void pv_wait_head_or_lock(struct
>  		 */
>  	}
>  
> +out:
>  	/*
>  	 * The cmpxchg() or xchg() call before coming here provides the
>  	 * acquire semantics for locking.
>  	 */
> +	return atomic_read(&lock->val);
>  }
>  
> +static const struct queue_ops pv_ops = {
> +	.init_node		= pv_init_node,
> +	.trylock		= pv_hybrid_queued_unfair_trylock,
> +	.wait_node		= pv_wait_node,
> +	.wait_head_or_lock	= pv_wait_head_or_lock,
> +	.kick_node		= pv_kick_node,
> +};
> +
>  /*
>   * PV versions of the unlock fastpath and slowpath functions to be used
>   * instead of queued_spin_unlock().
> @@ -756,18 +766,18 @@ __visible void __pv_queued_spin_unlock(s
>  EXPORT_SYMBOL(__pv_queued_spin_unlock);
>  #endif
>  
> -#else /* CONFIG_PARAVIRT_SPINLOCKS */
> -static __always_inline void pv_init_node(struct qnode *node) { }
> -static __always_inline void pv_wait_node(struct qnode *node,
> -					 struct qnode *prev) { }
> -static __always_inline void pv_kick_node(struct qspinlock *lock,
> -					 struct qnode *node) { }
> -static __always_inline void pv_wait_head_or_lock(struct qspinlock *lock,
> -						 struct qnode *node) { }
> -static __always_inline bool pv_hybrid_queued_unfair_trylock(struct qspinlock *lock) { BUILD_BUG(); }
>  #endif /* CONFIG_PARAVIRT_SPINLOCKS */
>  
> -static inline void queued_spin_lock_mcs_queue(struct qspinlock *lock, bool paravirt)
> +struct queue_ops {
> +	void (*init_node)(struct qnode *node);
> +	bool (*trylock)(struct qspinlock *lock);
> +	void (*wait_node)(struct qnode *node, struct qnode *prev);
> +	u32 (*wait_head_or_lock)(struct qspinlock *lock, struct qnode *node);
> +	void (*kick_node)(struct qspinlock *lock, struct qnode *node);
> +};
> +
> +static __always_inline
> +void queued_spin_lock_mcs_queue(struct qspinlock *lock, const struct queue_ops *ops)
>  {
>  	struct qnode *prev, *next, *node;
>  	u32 val, old, tail;
> @@ -813,16 +823,16 @@ static inline void queued_spin_lock_mcs_
>  
>  	node->locked = 0;
>  	node->next = NULL;
> -	if (paravirt)
> -		pv_init_node(node);
> +	if (ops && ops->init_node)
> +		ops->init_node(node);
>  
>  	/*
>  	 * We touched a (possibly) cold cacheline in the per-cpu queue node;
>  	 * attempt the trylock once more in the hope someone let go while we
>  	 * weren't watching.
>  	 */
> -	if (paravirt) {
> -		if (pv_hybrid_queued_unfair_trylock(lock))
> +	if (ops && ops->trylock) {
> +		if (ops->trylock(lock))
>  			goto release;
>  	} else {
>  		if (queued_spin_trylock(lock))
> @@ -857,8 +867,8 @@ static inline void queued_spin_lock_mcs_
>  		WRITE_ONCE(prev->next, node);
>  
>  		/* Wait for mcs node lock to be released */
> -		if (paravirt)
> -			pv_wait_node(node, prev);
> +		if (ops && ops->wait_node)
> +			ops->wait_node(node, prev);
>  		else
>  			smp_cond_load_acquire(&node->locked, VAL);
>  
> @@ -893,12 +903,11 @@ static inline void queued_spin_lock_mcs_
>  	 * If PV isn't active, 0 will be returned instead.
>  	 *
>  	 */
> -	if (paravirt) {
> -		pv_wait_head_or_lock(lock, node);
> -		val = atomic_read(&lock->val);
> +	if (ops && ops->wait_head_or_lock) {
> +		val = ops->wait_head_or_lock(lock, node);
>  	} else {
>  		val = atomic_cond_read_acquire(&lock->val,
> -				!(VAL & _Q_LOCKED_PENDING_MASK));
> +					       !(VAL & _Q_LOCKED_PENDING_MASK));
>  	}
>  
>  	/*
> @@ -1049,14 +1058,14 @@ void queued_spin_lock_slowpath(struct qs
>  	 */
>  queue:
>  	lockevent_inc(lock_slowpath);
> -	queued_spin_lock_mcs_queue(lock, false);
> +	queued_spin_lock_mcs_queue(lock, NULL);
>  }
>  EXPORT_SYMBOL(queued_spin_lock_slowpath);
>  
>  #ifdef CONFIG_PARAVIRT_SPINLOCKS
>  void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val)
>  {
> -	queued_spin_lock_mcs_queue(lock, true);
> +	queued_spin_lock_mcs_queue(lock, &pv_ops);
>  }
>  EXPORT_SYMBOL(__pv_queued_spin_lock_slowpath);
>  
> 

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2022-07-12  0:56 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-04 14:38 [PATCH 00/13] locking/qspinlock: simplify code generation Nicholas Piggin
2022-07-04 14:38 ` [PATCH 01/13] locking/qspinlock: remove pv_node abstraction Nicholas Piggin
2022-07-06 23:23   ` Boqun Feng
2022-07-04 14:38 ` [PATCH 02/13] locking/qspinlock: inline mcs_spinlock functions into qspinlock Nicholas Piggin
2022-07-05 16:57   ` Peter Zijlstra
2022-07-12  0:06     ` Nicholas Piggin
2022-07-04 14:38 ` [PATCH 03/13] locking/qspinlock: split common mcs queueing code into its own function Nicholas Piggin
2022-07-05 17:01   ` Peter Zijlstra
2022-07-12  0:10     ` Nicholas Piggin
2022-07-04 14:38 ` [PATCH 04/13] locking/qspinlock: move pv lock word helpers into qspinlock.c Nicholas Piggin
2022-07-05 19:34   ` Waiman Long
2022-07-12  0:11     ` Nicholas Piggin
2022-07-04 14:38 ` [PATCH 05/13] locking/qspinlock: be less clever with the preprocessor Nicholas Piggin
2022-07-05 17:08   ` Peter Zijlstra
2022-07-12  0:29     ` Nicholas Piggin
2022-07-05 20:02   ` Waiman Long
2022-07-12  0:33     ` Nicholas Piggin
2022-07-04 14:38 ` [PATCH 06/13] locking/qspinlock: merge qspinlock_paravirt.h into qspinlock.c Nicholas Piggin
2022-07-05 17:20   ` Peter Zijlstra
2022-07-05 17:36     ` Peter Zijlstra
2022-07-12  0:46       ` Nicholas Piggin
2022-07-06 13:35     ` Waiman Long
2022-07-06 14:16       ` Peter Zijlstra
2022-07-04 14:38 ` [PATCH 07/13] locking/qspinlock: remove arch qspinlock_paravirt.h includes Nicholas Piggin
2022-07-04 14:38 ` [PATCH 08/13] locking/qspinlock: stop renaming queued_spin_lock_slowpath to native_queued_spin_lock_slowpath Nicholas Piggin
2022-07-05 17:28   ` Peter Zijlstra
2022-07-04 14:38 ` [PATCH 09/13] locking/qspinlock: rename __pv_init_lock_hash to pv_spinlocks_init Nicholas Piggin
2022-07-04 14:38 ` [PATCH 10/13] locking/qspinlock: paravirt use simple trylock in case idx overflows Nicholas Piggin
2022-07-04 14:38 ` [PATCH 11/13] locking/qspinlock: Use queued_spin_trylock in pv_hybrid_queued_unfair_trylock Nicholas Piggin
2022-07-05 17:31   ` Peter Zijlstra
2022-07-05 20:15   ` Waiman Long
2022-07-12  0:48     ` Nicholas Piggin
2022-07-04 14:38 ` [PATCH 12/13] locking/qspinlock: separate pv_wait_node from the non-paravirt path Nicholas Piggin
2022-07-05 17:34   ` Peter Zijlstra
2022-07-12  0:50     ` Nicholas Piggin
2022-07-04 14:38 ` [PATCH 13/13] locking/qspinlock: simplify pv_wait_head_or_lock calling scheme Nicholas Piggin
2022-07-05 17:59 ` [PATCH 00/13] locking/qspinlock: simplify code generation Peter Zijlstra
2022-07-12  0:56   ` Nicholas Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).