linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed
* [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops
@ 2025-08-16 15:19 Yeoreum Yun
  2025-08-16 15:19 ` [PATCH RESEND v7 1/6] arm64: cpufeature: add FEAT_LSUI Yeoreum Yun
                   ` (7 more replies)
  0 siblings, 8 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-08-16 15:19 UTC (permalink / raw)
  To: catalin.marinas, will, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Yeoreum Yun

Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
previleged level to access to access user memory without clearing
PSTATE.PAN bit.

This patchset support FEAT_LSUI and applies in futex atomic operation
where can replace from ldxr/stlxr pair implmentation with clearing
PSTATE.PAN bit to correspondant load/store unprevileged atomic operation
without clearing PSTATE.PAN bit.

(Sorry, I've sent wrongly for patch version 7 and resend it.
 Again, sorry for mail-boom).

Patch Sequences
================

Patch #1 adds cpufeature for FEAT_LSUI

Patch #2 expose FEAT_LSUI to guest

Patch #3 adds Kconfig for FEAT_LSUI

Patch #4 refactor former futex atomic-op implmentation with ll/sc &
         clearing PSTATE.PAN

Patch #5 applies small optimisation for __llc_futex_atomic_set().

Patch #6 support futext atomic-op with FEAT_LSUI

Patch History
==============
from v6 to v7:
  - wrap FEAT_LSUI with CONFIG_AS_HAS_LSUI in cpufeature
  - remove unnecessary addition of indentation.
  - remove unnecessary mte_tco_enable()/disable() on LSUI operation.
  - https://lore.kernel.org/all/20250811163635.1562145-1-yeoreum.yun@arm.com/

from v5 to v6:
  - rebase to v6.17-rc1
  - https://lore.kernel.org/all/20250722121956.1509403-1-yeoreum.yun@arm.com/

from v4 to v5:
  - remove futex_ll_sc.h futext_lsui and lsui.h and move them to futex.h
  - reorganize the patches.
  - https://lore.kernel.org/all/20250721083618.2743569-1-yeoreum.yun@arm.com/

from v3 to v4:
  - rebase to v6.16-rc7
  - modify some patch's title.
  - https://lore.kernel.org/all/20250617183635.1266015-1-yeoreum.yun@arm.com/

from v2 to v3:
  - expose FEAT_LUSI to guest
  - add help section for LUSI Kconfig
  - https://lore.kernel.org/all/20250611151154.46362-1-yeoreum.yun@arm.com/

from v1 to v2:
  - remove empty v9.6 menu entry
  - locate HAS_LUSI in cpucaps in order
  - https://lore.kernel.org/all/20250611104916.10636-1-yeoreum.yun@arm.com/

Yeoreum Yun (6):
  arm64: cpufeature: add FEAT_LSUI
  KVM: arm64: expose FEAT_LSUI to guest
  arm64: Kconfig: add LSUI Kconfig
  arm64: futex: refactor futex atomic operation
  arm64: futex: small optimisation for __llsc_futex_atomic_set()
  arm64: futex: support futex with FEAT_LSUI

 arch/arm64/Kconfig             |   5 +
 arch/arm64/include/asm/futex.h | 291 +++++++++++++++++++++++++++------
 arch/arm64/kernel/cpufeature.c |  10 ++
 arch/arm64/kvm/sys_regs.c      |   5 +-
 arch/arm64/tools/cpucaps       |   1 +
 5 files changed, 261 insertions(+), 51 deletions(-)


base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH RESEND v7 1/6] arm64: cpufeature: add FEAT_LSUI
  2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
@ 2025-08-16 15:19 ` Yeoreum Yun
  2025-09-12 16:12   ` Catalin Marinas
  2025-08-16 15:19 ` [PATCH RESEND v7 2/6] KVM: arm64: expose FEAT_LSUI to guest Yeoreum Yun
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-08-16 15:19 UTC (permalink / raw)
  To: catalin.marinas, will, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Yeoreum Yun

Since Armv9.6, FEAT_LSUI supplies load/store instructions
for privileged level to access user memory without clearing PSTATE.PAN bit.

Add LSUI feature so that the unprevilieged load/store instructions
could be used when kernel accesses user memory without clearing PSTATE.PAN bit.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 arch/arm64/kernel/cpufeature.c | 10 ++++++++++
 arch/arm64/tools/cpucaps       |  1 +
 2 files changed, 11 insertions(+)

diff --git a/arch/arm64/kernel/cpufeature.c b/arch/arm64/kernel/cpufeature.c
index 9ad065f15f1d..b8660c8d51b2 100644
--- a/arch/arm64/kernel/cpufeature.c
+++ b/arch/arm64/kernel/cpufeature.c
@@ -278,6 +278,7 @@ static const struct arm64_ftr_bits ftr_id_aa64isar2[] = {

 static const struct arm64_ftr_bits ftr_id_aa64isar3[] = {
 	ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_FPRCVT_SHIFT, 4, 0),
+	ARM64_FTR_BITS(FTR_HIDDEN, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_LSUI_SHIFT, 4, ID_AA64ISAR3_EL1_LSUI_NI),
 	ARM64_FTR_BITS(FTR_VISIBLE, FTR_NONSTRICT, FTR_LOWER_SAFE, ID_AA64ISAR3_EL1_FAMINMAX_SHIFT, 4, 0),
 	ARM64_FTR_END,
 };
@@ -3131,6 +3132,15 @@ static const struct arm64_cpu_capabilities arm64_features[] = {
 		.matches = has_cpuid_feature,
 		ARM64_CPUID_FIELDS(ID_AA64PFR2_EL1, GCIE, IMP)
 	},
+#ifdef CONFIG_AS_HAS_LSUI
+	{
+		.desc = "Unprivileged Load Store Instructions (LSUI)",
+		.capability = ARM64_HAS_LSUI,
+		.type = ARM64_CPUCAP_SYSTEM_FEATURE,
+		.matches = has_cpuid_feature,
+		ARM64_CPUID_FIELDS(ID_AA64ISAR3_EL1, LSUI, IMP)
+	},
+#endif
 	{},
 };

diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
index ef0b7946f5a4..73f8e5211cd2 100644
--- a/arch/arm64/tools/cpucaps
+++ b/arch/arm64/tools/cpucaps
@@ -44,6 +44,7 @@ HAS_HCX
 HAS_LDAPR
 HAS_LPA2
 HAS_LSE_ATOMICS
+HAS_LSUI
 HAS_MOPS
 HAS_NESTED_VIRT
 HAS_BBML2_NOABORT
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RESEND v7 2/6] KVM: arm64: expose FEAT_LSUI to guest
  2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
  2025-08-16 15:19 ` [PATCH RESEND v7 1/6] arm64: cpufeature: add FEAT_LSUI Yeoreum Yun
@ 2025-08-16 15:19 ` Yeoreum Yun
  2025-09-12 16:25   ` Catalin Marinas
  2025-08-16 15:19 ` [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig Yeoreum Yun
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-08-16 15:19 UTC (permalink / raw)
  To: catalin.marinas, will, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Yeoreum Yun

expose FEAT_LSUI to guest.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
Acked-by: Marc Zyngier <maz@kernel.org>
---
 arch/arm64/kvm/sys_regs.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
index 82ffb3b3b3cf..fb6c154aa37d 100644
--- a/arch/arm64/kvm/sys_regs.c
+++ b/arch/arm64/kvm/sys_regs.c
@@ -1642,7 +1642,8 @@ static u64 __kvm_read_sanitised_id_reg(const struct kvm_vcpu *vcpu,
 			val &= ~ARM64_FEATURE_MASK(ID_AA64ISAR2_EL1_WFxT);
 		break;
 	case SYS_ID_AA64ISAR3_EL1:
-		val &= ID_AA64ISAR3_EL1_FPRCVT | ID_AA64ISAR3_EL1_FAMINMAX;
+		val &= ID_AA64ISAR3_EL1_FPRCVT | ID_AA64ISAR3_EL1_FAMINMAX |
+		       ID_AA64ISAR3_EL1_LSUI;
 		break;
 	case SYS_ID_AA64MMFR2_EL1:
 		val &= ~ID_AA64MMFR2_EL1_CCIDX_MASK;
@@ -2991,7 +2992,7 @@ static const struct sys_reg_desc sys_reg_descs[] = {
 					ID_AA64ISAR2_EL1_APA3 |
 					ID_AA64ISAR2_EL1_GPA3)),
 	ID_WRITABLE(ID_AA64ISAR3_EL1, (ID_AA64ISAR3_EL1_FPRCVT |
-				       ID_AA64ISAR3_EL1_FAMINMAX)),
+				       ID_AA64ISAR3_EL1_FAMINMAX | ID_AA64ISAR3_EL1_LSUI)),
 	ID_UNALLOCATED(6,4),
 	ID_UNALLOCATED(6,5),
 	ID_UNALLOCATED(6,6),
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig
  2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
  2025-08-16 15:19 ` [PATCH RESEND v7 1/6] arm64: cpufeature: add FEAT_LSUI Yeoreum Yun
  2025-08-16 15:19 ` [PATCH RESEND v7 2/6] KVM: arm64: expose FEAT_LSUI to guest Yeoreum Yun
@ 2025-08-16 15:19 ` Yeoreum Yun
  2025-09-12 16:24   ` Catalin Marinas
  2025-08-16 15:19 ` [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation Yeoreum Yun
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-08-16 15:19 UTC (permalink / raw)
  To: catalin.marinas, will, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Yeoreum Yun

Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
previleged level to access to access user memory without clearing
PSTATE.PAN bit.
It's enough to add CONFIG_AS_HAS_LSUI only because the code for LSUI uses
individual `.arch_extension` entries.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 arch/arm64/Kconfig | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index e9bbfacc35a6..c474de3dce02 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2239,6 +2239,11 @@ config ARM64_GCS

 endmenu # "v9.4 architectural features"

+config AS_HAS_LSUI
+	def_bool $(as-instr,.arch_extension lsui)
+	help
+	 Supported by LLVM 20 and later, not yet supported by GNU AS.
+
 config ARM64_SVE
 	bool "ARM Scalable Vector Extension support"
 	default y
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
                   ` (2 preceding siblings ...)
  2025-08-16 15:19 ` [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig Yeoreum Yun
@ 2025-08-16 15:19 ` Yeoreum Yun
  2025-09-11 15:38   ` Will Deacon
                     ` (2 more replies)
  2025-08-16 15:19 ` [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set() Yeoreum Yun
                   ` (3 subsequent siblings)
  7 siblings, 3 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-08-16 15:19 UTC (permalink / raw)
  To: catalin.marinas, will, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Yeoreum Yun

Refactor futex atomic operations using ll/sc method with
clearing PSTATE.PAN to prepare to apply FEAT_LSUI on them.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 arch/arm64/include/asm/futex.h | 132 +++++++++++++++++++++------------
 1 file changed, 84 insertions(+), 48 deletions(-)

diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
index bc06691d2062..ab7003cb4724 100644
--- a/arch/arm64/include/asm/futex.h
+++ b/arch/arm64/include/asm/futex.h
@@ -7,17 +7,21 @@

 #include <linux/futex.h>
 #include <linux/uaccess.h>
+#include <linux/stringify.h>

 #include <asm/errno.h>

-#define FUTEX_MAX_LOOPS	128 /* What's the largest number you can think of? */
+#define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */

-#define __futex_atomic_op(insn, ret, oldval, uaddr, tmp, oparg)		\
-do {									\
-	unsigned int loops = FUTEX_MAX_LOOPS;				\
+#define LLSC_FUTEX_ATOMIC_OP(op, insn)					\
+static __always_inline int						\
+__llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
+{									\
+	unsigned int loops = LLSC_MAX_LOOPS;				\
+	int ret, oldval, tmp;						\
 									\
 	uaccess_enable_privileged();					\
-	asm volatile(							\
+	asm volatile("// __llsc_futex_atomic_" #op "\n"			\
 "	prfm	pstl1strm, %2\n"					\
 "1:	ldxr	%w1, %2\n"						\
 	insn "\n"							\
@@ -35,45 +39,103 @@ do {									\
 	: "r" (oparg), "Ir" (-EAGAIN)					\
 	: "memory");							\
 	uaccess_disable_privileged();					\
-} while (0)
+									\
+	if (!ret)							\
+		*oval = oldval;						\
+									\
+	return ret;							\
+}
+
+LLSC_FUTEX_ATOMIC_OP(add, "add	%w3, %w1, %w5")
+LLSC_FUTEX_ATOMIC_OP(or, "orr	%w3, %w1, %w5")
+LLSC_FUTEX_ATOMIC_OP(and, "and	%w3, %w1, %w5")
+LLSC_FUTEX_ATOMIC_OP(eor, "eor	%w3, %w1, %w5")
+LLSC_FUTEX_ATOMIC_OP(set, "mov	%w3, %w5")
+
+static __always_inline int
+__llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
+{
+	int ret = 0;
+	unsigned int loops = LLSC_MAX_LOOPS;
+	u32 val, tmp;
+
+	uaccess_enable_privileged();
+	asm volatile("//__llsc_futex_cmpxchg\n"
+"	prfm	pstl1strm, %2\n"
+"1:	ldxr	%w1, %2\n"
+"	eor	%w3, %w1, %w5\n"
+"	cbnz	%w3, 4f\n"
+"2:	stlxr	%w3, %w6, %2\n"
+"	cbz	%w3, 3f\n"
+"	sub	%w4, %w4, %w3\n"
+"	cbnz	%w4, 1b\n"
+"	mov	%w0, %w7\n"
+"3:\n"
+"	dmb	ish\n"
+"4:\n"
+	_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
+	_ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
+	: "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
+	: "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
+	: "memory");
+	uaccess_disable_privileged();
+
+	if (!ret)
+		*oval = val;
+
+	return ret;
+}
+
+#define FUTEX_ATOMIC_OP(op)						\
+static __always_inline int						\
+__futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)		\
+{									\
+	return __llsc_futex_atomic_##op(oparg, uaddr, oval);		\
+}
+
+FUTEX_ATOMIC_OP(add)
+FUTEX_ATOMIC_OP(or)
+FUTEX_ATOMIC_OP(and)
+FUTEX_ATOMIC_OP(eor)
+FUTEX_ATOMIC_OP(set)
+
+static __always_inline int
+__futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
+{
+	return __llsc_futex_cmpxchg(uaddr, oldval, newval, oval);
+}

 static inline int
 arch_futex_atomic_op_inuser(int op, int oparg, int *oval, u32 __user *_uaddr)
 {
-	int oldval = 0, ret, tmp;
-	u32 __user *uaddr = __uaccess_mask_ptr(_uaddr);
+	int ret;
+	u32 __user *uaddr;

 	if (!access_ok(_uaddr, sizeof(u32)))
 		return -EFAULT;

+	uaddr = __uaccess_mask_ptr(_uaddr);
+
 	switch (op) {
 	case FUTEX_OP_SET:
-		__futex_atomic_op("mov	%w3, %w5",
-				  ret, oldval, uaddr, tmp, oparg);
+		ret = __futex_atomic_set(oparg, uaddr, oval);
 		break;
 	case FUTEX_OP_ADD:
-		__futex_atomic_op("add	%w3, %w1, %w5",
-				  ret, oldval, uaddr, tmp, oparg);
+		ret = __futex_atomic_add(oparg, uaddr, oval);
 		break;
 	case FUTEX_OP_OR:
-		__futex_atomic_op("orr	%w3, %w1, %w5",
-				  ret, oldval, uaddr, tmp, oparg);
+		ret = __futex_atomic_or(oparg, uaddr, oval);
 		break;
 	case FUTEX_OP_ANDN:
-		__futex_atomic_op("and	%w3, %w1, %w5",
-				  ret, oldval, uaddr, tmp, ~oparg);
+		ret = __futex_atomic_and(~oparg, uaddr, oval);
 		break;
 	case FUTEX_OP_XOR:
-		__futex_atomic_op("eor	%w3, %w1, %w5",
-				  ret, oldval, uaddr, tmp, oparg);
+		ret = __futex_atomic_eor(oparg, uaddr, oval);
 		break;
 	default:
 		ret = -ENOSYS;
 	}

-	if (!ret)
-		*oval = oldval;
-
 	return ret;
 }

@@ -81,40 +143,14 @@ static inline int
 futex_atomic_cmpxchg_inatomic(u32 *uval, u32 __user *_uaddr,
 			      u32 oldval, u32 newval)
 {
-	int ret = 0;
-	unsigned int loops = FUTEX_MAX_LOOPS;
-	u32 val, tmp;
 	u32 __user *uaddr;

 	if (!access_ok(_uaddr, sizeof(u32)))
 		return -EFAULT;

 	uaddr = __uaccess_mask_ptr(_uaddr);
-	uaccess_enable_privileged();
-	asm volatile("// futex_atomic_cmpxchg_inatomic\n"
-"	prfm	pstl1strm, %2\n"
-"1:	ldxr	%w1, %2\n"
-"	sub	%w3, %w1, %w5\n"
-"	cbnz	%w3, 4f\n"
-"2:	stlxr	%w3, %w6, %2\n"
-"	cbz	%w3, 3f\n"
-"	sub	%w4, %w4, %w3\n"
-"	cbnz	%w4, 1b\n"
-"	mov	%w0, %w7\n"
-"3:\n"
-"	dmb	ish\n"
-"4:\n"
-	_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
-	_ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
-	: "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
-	: "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
-	: "memory");
-	uaccess_disable_privileged();
-
-	if (!ret)
-		*uval = val;

-	return ret;
+	return __futex_cmpxchg(uaddr, oldval, newval, uval);
 }

 #endif /* __ASM_FUTEX_H */
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set()
  2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
                   ` (3 preceding siblings ...)
  2025-08-16 15:19 ` [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation Yeoreum Yun
@ 2025-08-16 15:19 ` Yeoreum Yun
  2025-09-11 15:28   ` Will Deacon
  2025-08-16 15:19 ` [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI Yeoreum Yun
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-08-16 15:19 UTC (permalink / raw)
  To: catalin.marinas, will, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Yeoreum Yun

__llsc_futex_atomic_set() is implmented using
LLSC_FUTEX_ATOMIC_OP() macro with "mov  %w3, %w5".
But this instruction isn't required to implement fux_atomic_set()
so make a small optimisation by implementing __llsc_futex_atomic_set()
as separate function.

This will make usage of LLSC_FUTEX_ATOMIC_OP() macro more simple.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 arch/arm64/include/asm/futex.h | 43 ++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 7 deletions(-)

diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
index ab7003cb4724..22a6301a9f3d 100644
--- a/arch/arm64/include/asm/futex.h
+++ b/arch/arm64/include/asm/futex.h
@@ -13,7 +13,7 @@

 #define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */

-#define LLSC_FUTEX_ATOMIC_OP(op, insn)					\
+#define LLSC_FUTEX_ATOMIC_OP(op, asm_op)				\
 static __always_inline int						\
 __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
 {									\
@@ -24,7 +24,7 @@ __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
 	asm volatile("// __llsc_futex_atomic_" #op "\n"			\
 "	prfm	pstl1strm, %2\n"					\
 "1:	ldxr	%w1, %2\n"						\
-	insn "\n"							\
+"	" #asm_op "	%w3, %w1, %w5\n"				\
 "2:	stlxr	%w0, %w3, %2\n"						\
 "	cbz	%w0, 3f\n"						\
 "	sub	%w4, %w4, %w0\n"					\
@@ -46,11 +46,40 @@ __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
 	return ret;							\
 }

-LLSC_FUTEX_ATOMIC_OP(add, "add	%w3, %w1, %w5")
-LLSC_FUTEX_ATOMIC_OP(or, "orr	%w3, %w1, %w5")
-LLSC_FUTEX_ATOMIC_OP(and, "and	%w3, %w1, %w5")
-LLSC_FUTEX_ATOMIC_OP(eor, "eor	%w3, %w1, %w5")
-LLSC_FUTEX_ATOMIC_OP(set, "mov	%w3, %w5")
+LLSC_FUTEX_ATOMIC_OP(add, add)
+LLSC_FUTEX_ATOMIC_OP(or, orr)
+LLSC_FUTEX_ATOMIC_OP(and, and)
+LLSC_FUTEX_ATOMIC_OP(eor, eor)
+
+static __always_inline int
+__llsc_futex_atomic_set(int oparg, u32 __user *uaddr, int *oval)
+{
+	unsigned int loops = LLSC_MAX_LOOPS;
+	int ret, oldval;
+
+	uaccess_enable_privileged();
+	asm volatile("//__llsc_futex_xchg\n"
+"	prfm	pstl1strm, %2\n"
+"1:	ldxr	%w1, %2\n"
+"2:	stlxr	%w0, %w4, %2\n"
+"	cbz	%w3, 3f\n"
+"	sub	%w3, %w3, %w0\n"
+"	cbnz	%w3, 1b\n"
+"	mov	%w0, %w5\n"
+"3:\n"
+"	dmb	ish\n"
+	_ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
+	_ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
+	: "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "+r" (loops)
+	: "r" (oparg), "Ir" (-EAGAIN)
+	: "memory");
+	uaccess_disable_privileged();
+
+	if (!ret)
+		*oval = oldval;
+
+	return ret;
+}

 static __always_inline int
 __llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI
  2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
                   ` (4 preceding siblings ...)
  2025-08-16 15:19 ` [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set() Yeoreum Yun
@ 2025-08-16 15:19 ` Yeoreum Yun
  2025-09-11 15:22   ` Will Deacon
  2025-09-12 17:09   ` Catalin Marinas
  2025-09-01 10:06 ` [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
  2025-09-11 15:09 ` Will Deacon
  7 siblings, 2 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-08-16 15:19 UTC (permalink / raw)
  To: catalin.marinas, will, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland
  Cc: linux-arm-kernel, kvmarm, linux-kernel, Yeoreum Yun

Current futex atomic operations are implemented with ll/sc instructions
and clearing PSTATE.PAN.

Since Armv9.6, FEAT_LSUI supplies not only load/store instructions but
also atomic operation for user memory access in kernel it doesn't need
to clear PSTATE.PAN bit anymore.

With theses instructions some of futex atomic operations don't need to
be implmented with ldxr/stlxr pair instead can be implmented with
one atomic operation supplied by FEAT_LSUI.

However, some of futex atomic operations still need to use ll/sc way
via ldtxr/stltxr supplied by FEAT_LSUI since there is no correspondant
atomic instruction or doesn't support word size operation.
(i.e) eor, cas{mb}t

But It's good to work without clearing PSTATE.PAN bit.

Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
---
 arch/arm64/include/asm/futex.h | 130 ++++++++++++++++++++++++++++++++-
 1 file changed, 129 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
index 22a6301a9f3d..ece35ca9b5d9 100644
--- a/arch/arm64/include/asm/futex.h
+++ b/arch/arm64/include/asm/futex.h
@@ -9,6 +9,8 @@
 #include <linux/uaccess.h>
 #include <linux/stringify.h>

+#include <asm/alternative.h>
+#include <asm/alternative-macros.h>
 #include <asm/errno.h>

 #define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
@@ -115,11 +117,137 @@ __llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
 	return ret;
 }

+#ifdef CONFIG_AS_HAS_LSUI
+
+#define __LSUI_PREAMBLE	".arch_extension lsui\n"
+
+#define LSUI_FUTEX_ATOMIC_OP(op, asm_op, mb)				\
+static __always_inline int						\
+__lsui_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
+{									\
+	int ret = 0;							\
+	int oldval;							\
+									\
+	uaccess_ttbr0_enable();						\
+	asm volatile("// __lsui_futex_atomic_" #op "\n"			\
+	__LSUI_PREAMBLE							\
+"1:	" #asm_op #mb "	%w3, %w2, %1\n"					\
+"2:\n"									\
+	_ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w0)				\
+	: "+r" (ret), "+Q" (*uaddr), "=r" (oldval)			\
+	: "r" (oparg)							\
+	: "memory");							\
+	uaccess_ttbr0_disable();					\
+									\
+	if (!ret)							\
+		*oval = oldval;						\
+									\
+	return ret;							\
+}
+
+LSUI_FUTEX_ATOMIC_OP(add, ldtadd, al)
+LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
+LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
+LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
+
+static __always_inline int
+__lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
+{
+	return __lsui_futex_atomic_andnot(~oparg, uaddr, oval);
+}
+
+static __always_inline int
+__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
+{
+	unsigned int loops = LLSC_MAX_LOOPS;
+	int ret, oldval, tmp;
+
+	uaccess_ttbr0_enable();
+	/*
+	 * there are no ldteor/stteor instructions...
+	 */
+	asm volatile("// __lsui_futex_atomic_eor\n"
+	__LSUI_PREAMBLE
+"	prfm	pstl1strm, %2\n"
+"1:	ldtxr	%w1, %2\n"
+"	eor	%w3, %w1, %w5\n"
+"2:	stltxr	%w0, %w3, %2\n"
+"	cbz	%w0, 3f\n"
+"	sub	%w4, %w4, %w0\n"
+"	cbnz	%w4, 1b\n"
+"	mov	%w0, %w6\n"
+"3:\n"
+"	dmb	ish\n"
+	_ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
+	_ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
+	: "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp),
+	  "+r" (loops)
+	: "r" (oparg), "Ir" (-EAGAIN)
+	: "memory");
+	uaccess_ttbr0_disable();
+
+	if (!ret)
+		*oval = oldval;
+
+	return ret;
+}
+
+static __always_inline int
+__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
+{
+	int ret = 0;
+	unsigned int loops = LLSC_MAX_LOOPS;
+	u32 val, tmp;
+
+	uaccess_ttbr0_enable();
+	/*
+	 * cas{al}t doesn't support word size...
+	 */
+	asm volatile("//__lsui_futex_cmpxchg\n"
+	__LSUI_PREAMBLE
+"	prfm	pstl1strm, %2\n"
+"1:	ldtxr	%w1, %2\n"
+"	eor	%w3, %w1, %w5\n"
+"	cbnz	%w3, 4f\n"
+"2:	stltxr	%w3, %w6, %2\n"
+"	cbz	%w3, 3f\n"
+"	sub	%w4, %w4, %w3\n"
+"	cbnz	%w4, 1b\n"
+"	mov	%w0, %w7\n"
+"3:\n"
+"	dmb	ish\n"
+"4:\n"
+	_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
+	_ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
+	: "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
+	: "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
+	: "memory");
+	uaccess_ttbr0_disable();
+
+	if (!ret)
+		*oval = oldval;
+
+	return ret;
+}
+
+#define __lsui_llsc_body(op, ...)					\
+({									\
+	alternative_has_cap_likely(ARM64_HAS_LSUI) ?			\
+		__lsui_##op(__VA_ARGS__) : __llsc_##op(__VA_ARGS__);	\
+})
+
+#else	/* CONFIG_AS_HAS_LSUI */
+
+#define __lsui_llsc_body(op, ...)	__llsc_##op(__VA_ARGS__)
+
+#endif	/* CONFIG_AS_HAS_LSUI */
+
+
 #define FUTEX_ATOMIC_OP(op)						\
 static __always_inline int						\
 __futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)		\
 {									\
-	return __llsc_futex_atomic_##op(oparg, uaddr, oval);		\
+	return __lsui_llsc_body(futex_atomic_##op, oparg, uaddr, oval);	\
 }

 FUTEX_ATOMIC_OP(add)
--
LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}



^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops
  2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
                   ` (5 preceding siblings ...)
  2025-08-16 15:19 ` [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI Yeoreum Yun
@ 2025-09-01 10:06 ` Yeoreum Yun
  2025-09-11 15:09 ` Will Deacon
  7 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-01 10:06 UTC (permalink / raw)
  To: catalin.marinas, will, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland
  Cc: linux-arm-kernel, kvmarm, linux-kernel

Gentle ping in case of forgotten.

On Sat, Aug 16, 2025 at 04:19:23PM +0100, Yeoreum Yun wrote:
> Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
> previleged level to access to access user memory without clearing
> PSTATE.PAN bit.
>
> This patchset support FEAT_LSUI and applies in futex atomic operation
> where can replace from ldxr/stlxr pair implmentation with clearing
> PSTATE.PAN bit to correspondant load/store unprevileged atomic operation
> without clearing PSTATE.PAN bit.
>
> (Sorry, I've sent wrongly for patch version 7 and resend it.
>  Again, sorry for mail-boom).
>
> Patch Sequences
> ================
>
> Patch #1 adds cpufeature for FEAT_LSUI
>
> Patch #2 expose FEAT_LSUI to guest
>
> Patch #3 adds Kconfig for FEAT_LSUI
>
> Patch #4 refactor former futex atomic-op implmentation with ll/sc &
>          clearing PSTATE.PAN
>
> Patch #5 applies small optimisation for __llc_futex_atomic_set().
>
> Patch #6 support futext atomic-op with FEAT_LSUI
>
> Patch History
> ==============
> from v6 to v7:
>   - wrap FEAT_LSUI with CONFIG_AS_HAS_LSUI in cpufeature
>   - remove unnecessary addition of indentation.
>   - remove unnecessary mte_tco_enable()/disable() on LSUI operation.
>   - https://lore.kernel.org/all/20250811163635.1562145-1-yeoreum.yun@arm.com/
>
> from v5 to v6:
>   - rebase to v6.17-rc1
>   - https://lore.kernel.org/all/20250722121956.1509403-1-yeoreum.yun@arm.com/
>
> from v4 to v5:
>   - remove futex_ll_sc.h futext_lsui and lsui.h and move them to futex.h
>   - reorganize the patches.
>   - https://lore.kernel.org/all/20250721083618.2743569-1-yeoreum.yun@arm.com/
>
> from v3 to v4:
>   - rebase to v6.16-rc7
>   - modify some patch's title.
>   - https://lore.kernel.org/all/20250617183635.1266015-1-yeoreum.yun@arm.com/
>
> from v2 to v3:
>   - expose FEAT_LUSI to guest
>   - add help section for LUSI Kconfig
>   - https://lore.kernel.org/all/20250611151154.46362-1-yeoreum.yun@arm.com/
>
> from v1 to v2:
>   - remove empty v9.6 menu entry
>   - locate HAS_LUSI in cpucaps in order
>   - https://lore.kernel.org/all/20250611104916.10636-1-yeoreum.yun@arm.com/
>
> Yeoreum Yun (6):
>   arm64: cpufeature: add FEAT_LSUI
>   KVM: arm64: expose FEAT_LSUI to guest
>   arm64: Kconfig: add LSUI Kconfig
>   arm64: futex: refactor futex atomic operation
>   arm64: futex: small optimisation for __llsc_futex_atomic_set()
>   arm64: futex: support futex with FEAT_LSUI
>
>  arch/arm64/Kconfig             |   5 +
>  arch/arm64/include/asm/futex.h | 291 +++++++++++++++++++++++++++------
>  arch/arm64/kernel/cpufeature.c |  10 ++
>  arch/arm64/kvm/sys_regs.c      |   5 +-
>  arch/arm64/tools/cpucaps       |   1 +
>  5 files changed, 261 insertions(+), 51 deletions(-)
>
>
> base-commit: 8f5ae30d69d7543eee0d70083daf4de8fe15d585
> --
> LEVI:{C3F47F37-75D8-414A-A8BA-3980EC8A46D7}
>

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops
  2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
                   ` (6 preceding siblings ...)
  2025-09-01 10:06 ` [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
@ 2025-09-11 15:09 ` Will Deacon
  2025-09-11 16:22   ` Catalin Marinas
  7 siblings, 1 reply; 51+ messages in thread
From: Will Deacon @ 2025-09-11 15:09 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: catalin.marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:23PM +0100, Yeoreum Yun wrote:
> Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
> previleged level to access to access user memory without clearing
> PSTATE.PAN bit.
> 
> This patchset support FEAT_LSUI and applies in futex atomic operation
> where can replace from ldxr/stlxr pair implmentation with clearing
> PSTATE.PAN bit to correspondant load/store unprevileged atomic operation
> without clearing PSTATE.PAN bit.
> 
> (Sorry, I've sent wrongly for patch version 7 and resend it.
>  Again, sorry for mail-boom).

I tried to review this but I can't find any details about FEAT_LSUI in
the latest Arm ARM. Where should I be looking for the architecture spec?

Will


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI
  2025-08-16 15:19 ` [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI Yeoreum Yun
@ 2025-09-11 15:22   ` Will Deacon
  2025-09-11 16:45     ` Yeoreum Yun
  2025-09-12 17:16     ` Catalin Marinas
  2025-09-12 17:09   ` Catalin Marinas
  1 sibling, 2 replies; 51+ messages in thread
From: Will Deacon @ 2025-09-11 15:22 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: catalin.marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:29PM +0100, Yeoreum Yun wrote:
> Current futex atomic operations are implemented with ll/sc instructions
> and clearing PSTATE.PAN.
> 
> Since Armv9.6, FEAT_LSUI supplies not only load/store instructions but
> also atomic operation for user memory access in kernel it doesn't need
> to clear PSTATE.PAN bit anymore.
> 
> With theses instructions some of futex atomic operations don't need to
> be implmented with ldxr/stlxr pair instead can be implmented with
> one atomic operation supplied by FEAT_LSUI.
> 
> However, some of futex atomic operations still need to use ll/sc way
> via ldtxr/stltxr supplied by FEAT_LSUI since there is no correspondant
> atomic instruction or doesn't support word size operation.
> (i.e) eor, cas{mb}t
> 
> But It's good to work without clearing PSTATE.PAN bit.
> 
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
>  arch/arm64/include/asm/futex.h | 130 ++++++++++++++++++++++++++++++++-
>  1 file changed, 129 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> index 22a6301a9f3d..ece35ca9b5d9 100644
> --- a/arch/arm64/include/asm/futex.h
> +++ b/arch/arm64/include/asm/futex.h
> @@ -9,6 +9,8 @@
>  #include <linux/uaccess.h>
>  #include <linux/stringify.h>
> 
> +#include <asm/alternative.h>
> +#include <asm/alternative-macros.h>
>  #include <asm/errno.h>
> 
>  #define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
> @@ -115,11 +117,137 @@ __llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
>  	return ret;
>  }
> 
> +#ifdef CONFIG_AS_HAS_LSUI
> +
> +#define __LSUI_PREAMBLE	".arch_extension lsui\n"
> +
> +#define LSUI_FUTEX_ATOMIC_OP(op, asm_op, mb)				\
> +static __always_inline int						\
> +__lsui_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> +{									\
> +	int ret = 0;							\
> +	int oldval;							\
> +									\
> +	uaccess_ttbr0_enable();						\
> +	asm volatile("// __lsui_futex_atomic_" #op "\n"			\
> +	__LSUI_PREAMBLE							\
> +"1:	" #asm_op #mb "	%w3, %w2, %1\n"					\
> +"2:\n"									\
> +	_ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w0)				\
> +	: "+r" (ret), "+Q" (*uaddr), "=r" (oldval)			\
> +	: "r" (oparg)							\
> +	: "memory");							\
> +	uaccess_ttbr0_disable();					\
> +									\
> +	if (!ret)							\
> +		*oval = oldval;						\
> +									\
> +	return ret;							\
> +}
> +
> +LSUI_FUTEX_ATOMIC_OP(add, ldtadd, al)
> +LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
> +LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
> +LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
> +
> +static __always_inline int
> +__lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
> +{
> +	return __lsui_futex_atomic_andnot(~oparg, uaddr, oval);
> +}
> +
> +static __always_inline int
> +__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
> +{
> +	unsigned int loops = LLSC_MAX_LOOPS;
> +	int ret, oldval, tmp;
> +
> +	uaccess_ttbr0_enable();
> +	/*
> +	 * there are no ldteor/stteor instructions...
> +	 */

*sigh*

Were these new instructions not added with futex in mind?

I wonder whether CAS would be better than exclusives for xor...

> +static __always_inline int
> +__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> +{
> +	int ret = 0;
> +	unsigned int loops = LLSC_MAX_LOOPS;
> +	u32 val, tmp;
> +
> +	uaccess_ttbr0_enable();
> +	/*
> +	 * cas{al}t doesn't support word size...
> +	 */

What about just aligning down and doing a 64-bit cas in that case?

Will


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set()
  2025-08-16 15:19 ` [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set() Yeoreum Yun
@ 2025-09-11 15:28   ` Will Deacon
  2025-09-11 16:19     ` Yeoreum Yun
  0 siblings, 1 reply; 51+ messages in thread
From: Will Deacon @ 2025-09-11 15:28 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: catalin.marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:28PM +0100, Yeoreum Yun wrote:
> __llsc_futex_atomic_set() is implmented using
> LLSC_FUTEX_ATOMIC_OP() macro with "mov  %w3, %w5".
> But this instruction isn't required to implement fux_atomic_set()
> so make a small optimisation by implementing __llsc_futex_atomic_set()
> as separate function.
> 
> This will make usage of LLSC_FUTEX_ATOMIC_OP() macro more simple.
> 
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
>  arch/arm64/include/asm/futex.h | 43 ++++++++++++++++++++++++++++------
>  1 file changed, 36 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> index ab7003cb4724..22a6301a9f3d 100644
> --- a/arch/arm64/include/asm/futex.h
> +++ b/arch/arm64/include/asm/futex.h
> @@ -13,7 +13,7 @@
> 
>  #define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
> 
> -#define LLSC_FUTEX_ATOMIC_OP(op, insn)					\
> +#define LLSC_FUTEX_ATOMIC_OP(op, asm_op)				\
>  static __always_inline int						\
>  __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
>  {									\
> @@ -24,7 +24,7 @@ __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
>  	asm volatile("// __llsc_futex_atomic_" #op "\n"			\
>  "	prfm	pstl1strm, %2\n"					\
>  "1:	ldxr	%w1, %2\n"						\
> -	insn "\n"							\
> +"	" #asm_op "	%w3, %w1, %w5\n"				\
>  "2:	stlxr	%w0, %w3, %2\n"						\
>  "	cbz	%w0, 3f\n"						\
>  "	sub	%w4, %w4, %w0\n"					\
> @@ -46,11 +46,40 @@ __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
>  	return ret;							\
>  }
> 
> -LLSC_FUTEX_ATOMIC_OP(add, "add	%w3, %w1, %w5")
> -LLSC_FUTEX_ATOMIC_OP(or, "orr	%w3, %w1, %w5")
> -LLSC_FUTEX_ATOMIC_OP(and, "and	%w3, %w1, %w5")
> -LLSC_FUTEX_ATOMIC_OP(eor, "eor	%w3, %w1, %w5")
> -LLSC_FUTEX_ATOMIC_OP(set, "mov	%w3, %w5")
> +LLSC_FUTEX_ATOMIC_OP(add, add)
> +LLSC_FUTEX_ATOMIC_OP(or, orr)
> +LLSC_FUTEX_ATOMIC_OP(and, and)
> +LLSC_FUTEX_ATOMIC_OP(eor, eor)
> +
> +static __always_inline int
> +__llsc_futex_atomic_set(int oparg, u32 __user *uaddr, int *oval)
> +{
> +	unsigned int loops = LLSC_MAX_LOOPS;
> +	int ret, oldval;
> +
> +	uaccess_enable_privileged();
> +	asm volatile("//__llsc_futex_xchg\n"
> +"	prfm	pstl1strm, %2\n"
> +"1:	ldxr	%w1, %2\n"
> +"2:	stlxr	%w0, %w4, %2\n"
> +"	cbz	%w3, 3f\n"
> +"	sub	%w3, %w3, %w0\n"
> +"	cbnz	%w3, 1b\n"
> +"	mov	%w0, %w5\n"
> +"3:\n"
> +"	dmb	ish\n"
> +	_ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
> +	_ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
> +	: "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "+r" (loops)
> +	: "r" (oparg), "Ir" (-EAGAIN)
> +	: "memory");
> +	uaccess_disable_privileged();
> +
> +	if (!ret)
> +		*oval = oldval;

Hmm, I'm really not sure this is worthwhile. I doubt the "optimisation"
actually does anything and adding a whole new block of asm just for the
SET case isn't much of an improvement on the maintainability side, either.

Will


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-08-16 15:19 ` [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation Yeoreum Yun
@ 2025-09-11 15:38   ` Will Deacon
  2025-09-11 16:04     ` Yeoreum Yun
  2025-09-12 16:44   ` Catalin Marinas
  2025-09-12 16:53   ` Catalin Marinas
  2 siblings, 1 reply; 51+ messages in thread
From: Will Deacon @ 2025-09-11 15:38 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: catalin.marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> Refactor futex atomic operations using ll/sc method with
> clearing PSTATE.PAN to prepare to apply FEAT_LSUI on them.
> 
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
>  arch/arm64/include/asm/futex.h | 132 +++++++++++++++++++++------------
>  1 file changed, 84 insertions(+), 48 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> index bc06691d2062..ab7003cb4724 100644
> --- a/arch/arm64/include/asm/futex.h
> +++ b/arch/arm64/include/asm/futex.h
> @@ -7,17 +7,21 @@
> 
>  #include <linux/futex.h>
>  #include <linux/uaccess.h>
> +#include <linux/stringify.h>
> 
>  #include <asm/errno.h>
> 
> -#define FUTEX_MAX_LOOPS	128 /* What's the largest number you can think of? */
> +#define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
> 
> -#define __futex_atomic_op(insn, ret, oldval, uaddr, tmp, oparg)		\
> -do {									\
> -	unsigned int loops = FUTEX_MAX_LOOPS;				\
> +#define LLSC_FUTEX_ATOMIC_OP(op, insn)					\
> +static __always_inline int						\
> +__llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> +{									\
> +	unsigned int loops = LLSC_MAX_LOOPS;				\
> +	int ret, oldval, tmp;						\
>  									\
>  	uaccess_enable_privileged();					\
> -	asm volatile(							\
> +	asm volatile("// __llsc_futex_atomic_" #op "\n"			\
>  "	prfm	pstl1strm, %2\n"					\
>  "1:	ldxr	%w1, %2\n"						\
>  	insn "\n"							\
> @@ -35,45 +39,103 @@ do {									\
>  	: "r" (oparg), "Ir" (-EAGAIN)					\
>  	: "memory");							\
>  	uaccess_disable_privileged();					\
> -} while (0)
> +									\
> +	if (!ret)							\
> +		*oval = oldval;						\

Why push the store to '*oval' down into here?

Will


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-11 15:38   ` Will Deacon
@ 2025-09-11 16:04     ` Yeoreum Yun
  0 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-11 16:04 UTC (permalink / raw)
  To: Will Deacon
  Cc: catalin.marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi Will,

[...]

> > -#define __futex_atomic_op(insn, ret, oldval, uaddr, tmp, oparg)		\
> > -do {									\
> > -	unsigned int loops = FUTEX_MAX_LOOPS;				\
> > +#define LLSC_FUTEX_ATOMIC_OP(op, insn)					\
> > +static __always_inline int						\
> > +__llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> > +{									\
> > +	unsigned int loops = LLSC_MAX_LOOPS;				\
> > +	int ret, oldval, tmp;						\
> >  									\
> >  	uaccess_enable_privileged();					\
> > -	asm volatile(							\
> > +	asm volatile("// __llsc_futex_atomic_" #op "\n"			\
> >  "	prfm	pstl1strm, %2\n"					\
> >  "1:	ldxr	%w1, %2\n"						\
> >  	insn "\n"							\
> > @@ -35,45 +39,103 @@ do {									\
> >  	: "r" (oparg), "Ir" (-EAGAIN)					\
> >  	: "memory");							\
> >  	uaccess_disable_privileged();					\
> > -} while (0)
> > +									\
> > +	if (!ret)							\
> > +		*oval = oldval;						\
>
> Why push the store to '*oval' down into here?

As __llsc_futext_atomic_##op() is declared with inline function
I think it would be better to pass oval from arch_futex_atomic_op_inuser()
as it is for readability.

Is it awful?

Thanks.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set()
  2025-09-11 15:28   ` Will Deacon
@ 2025-09-11 16:19     ` Yeoreum Yun
  2025-09-12 16:36       ` Catalin Marinas
  0 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-11 16:19 UTC (permalink / raw)
  To: Will Deacon
  Cc: catalin.marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi Will,

[...]

> >  arch/arm64/include/asm/futex.h | 43 ++++++++++++++++++++++++++++------
> >  1 file changed, 36 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > index ab7003cb4724..22a6301a9f3d 100644
> > --- a/arch/arm64/include/asm/futex.h
> > +++ b/arch/arm64/include/asm/futex.h
> > @@ -13,7 +13,7 @@
> >
> >  #define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
> >
> > -#define LLSC_FUTEX_ATOMIC_OP(op, insn)					\
> > +#define LLSC_FUTEX_ATOMIC_OP(op, asm_op)				\
> >  static __always_inline int						\
> >  __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> >  {									\
> > @@ -24,7 +24,7 @@ __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> >  	asm volatile("// __llsc_futex_atomic_" #op "\n"			\
> >  "	prfm	pstl1strm, %2\n"					\
> >  "1:	ldxr	%w1, %2\n"						\
> > -	insn "\n"							\
> > +"	" #asm_op "	%w3, %w1, %w5\n"				\
> >  "2:	stlxr	%w0, %w3, %2\n"						\
> >  "	cbz	%w0, 3f\n"						\
> >  "	sub	%w4, %w4, %w0\n"					\
> > @@ -46,11 +46,40 @@ __llsc_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> >  	return ret;							\
> >  }
> >
> > -LLSC_FUTEX_ATOMIC_OP(add, "add	%w3, %w1, %w5")
> > -LLSC_FUTEX_ATOMIC_OP(or, "orr	%w3, %w1, %w5")
> > -LLSC_FUTEX_ATOMIC_OP(and, "and	%w3, %w1, %w5")
> > -LLSC_FUTEX_ATOMIC_OP(eor, "eor	%w3, %w1, %w5")
> > -LLSC_FUTEX_ATOMIC_OP(set, "mov	%w3, %w5")
> > +LLSC_FUTEX_ATOMIC_OP(add, add)
> > +LLSC_FUTEX_ATOMIC_OP(or, orr)
> > +LLSC_FUTEX_ATOMIC_OP(and, and)
> > +LLSC_FUTEX_ATOMIC_OP(eor, eor)
> > +
> > +static __always_inline int
> > +__llsc_futex_atomic_set(int oparg, u32 __user *uaddr, int *oval)
> > +{
> > +	unsigned int loops = LLSC_MAX_LOOPS;
> > +	int ret, oldval;
> > +
> > +	uaccess_enable_privileged();
> > +	asm volatile("//__llsc_futex_xchg\n"
> > +"	prfm	pstl1strm, %2\n"
> > +"1:	ldxr	%w1, %2\n"
> > +"2:	stlxr	%w0, %w4, %2\n"
> > +"	cbz	%w3, 3f\n"
> > +"	sub	%w3, %w3, %w0\n"
> > +"	cbnz	%w3, 1b\n"
> > +"	mov	%w0, %w5\n"
> > +"3:\n"
> > +"	dmb	ish\n"
> > +	_ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
> > +	_ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
> > +	: "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "+r" (loops)
> > +	: "r" (oparg), "Ir" (-EAGAIN)
> > +	: "memory");
> > +	uaccess_disable_privileged();
> > +
> > +	if (!ret)
> > +		*oval = oldval;
>
> Hmm, I'm really not sure this is worthwhile. I doubt the "optimisation"
> actually does anything and adding a whole new block of asm just for the
> SET case isn't much of an improvement on the maintainability side, either.

TBH, I had the same question, but I thought this code seems to modify
freqenetly, I decide even a small optimisation -- reduce one instruction
only.

But I don't have strong opinion for this patch.
If it's not good for maintainability perspective,
This patch can be dropped.

Thanks!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops
  2025-09-11 15:09 ` Will Deacon
@ 2025-09-11 16:22   ` Catalin Marinas
  2025-09-15 20:37     ` Will Deacon
  0 siblings, 1 reply; 51+ messages in thread
From: Catalin Marinas @ 2025-09-11 16:22 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yeoreum Yun, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Sep 11, 2025 at 04:09:42PM +0100, Will Deacon wrote:
> On Sat, Aug 16, 2025 at 04:19:23PM +0100, Yeoreum Yun wrote:
> > Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
> > previleged level to access to access user memory without clearing
> > PSTATE.PAN bit.
> > 
> > This patchset support FEAT_LSUI and applies in futex atomic operation
> > where can replace from ldxr/stlxr pair implmentation with clearing
> > PSTATE.PAN bit to correspondant load/store unprevileged atomic operation
> > without clearing PSTATE.PAN bit.
> > 
> > (Sorry, I've sent wrongly for patch version 7 and resend it.
> >  Again, sorry for mail-boom).
> 
> I tried to review this but I can't find any details about FEAT_LSUI in
> the latest Arm ARM. Where should I be looking for the architecture spec?

Unfortunately, it's just in the public xml at the moment. Hopefully
we'll get a release of the Arm ARM by the end of the year. Otherwise, in
the private 2024 arch spec. Not ideal though.

If you'd rather wait until in turns up in the public spec, fine by me.

-- 
Catalin


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI
  2025-09-11 15:22   ` Will Deacon
@ 2025-09-11 16:45     ` Yeoreum Yun
  2025-09-12 17:16     ` Catalin Marinas
  1 sibling, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-11 16:45 UTC (permalink / raw)
  To: Will Deacon
  Cc: catalin.marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi Will,

> > Current futex atomic operations are implemented with ll/sc instructions
> > and clearing PSTATE.PAN.
> >
> > Since Armv9.6, FEAT_LSUI supplies not only load/store instructions but
> > also atomic operation for user memory access in kernel it doesn't need
> > to clear PSTATE.PAN bit anymore.
> >
> > With theses instructions some of futex atomic operations don't need to
> > be implmented with ldxr/stlxr pair instead can be implmented with
> > one atomic operation supplied by FEAT_LSUI.
> >
> > However, some of futex atomic operations still need to use ll/sc way
> > via ldtxr/stltxr supplied by FEAT_LSUI since there is no correspondant
> > atomic instruction or doesn't support word size operation.
> > (i.e) eor, cas{mb}t
> >
> > But It's good to work without clearing PSTATE.PAN bit.
> >
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > ---
> >  arch/arm64/include/asm/futex.h | 130 ++++++++++++++++++++++++++++++++-
> >  1 file changed, 129 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > index 22a6301a9f3d..ece35ca9b5d9 100644
> > --- a/arch/arm64/include/asm/futex.h
> > +++ b/arch/arm64/include/asm/futex.h
> > @@ -9,6 +9,8 @@
> >  #include <linux/uaccess.h>
> >  #include <linux/stringify.h>
> >
> > +#include <asm/alternative.h>
> > +#include <asm/alternative-macros.h>
> >  #include <asm/errno.h>
> >
> >  #define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
> > @@ -115,11 +117,137 @@ __llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> >  	return ret;
> >  }
> >
> > +#ifdef CONFIG_AS_HAS_LSUI
> > +
> > +#define __LSUI_PREAMBLE	".arch_extension lsui\n"
> > +
> > +#define LSUI_FUTEX_ATOMIC_OP(op, asm_op, mb)				\
> > +static __always_inline int						\
> > +__lsui_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> > +{									\
> > +	int ret = 0;							\
> > +	int oldval;							\
> > +									\
> > +	uaccess_ttbr0_enable();						\
> > +	asm volatile("// __lsui_futex_atomic_" #op "\n"			\
> > +	__LSUI_PREAMBLE							\
> > +"1:	" #asm_op #mb "	%w3, %w2, %1\n"					\
> > +"2:\n"									\
> > +	_ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w0)				\
> > +	: "+r" (ret), "+Q" (*uaddr), "=r" (oldval)			\
> > +	: "r" (oparg)							\
> > +	: "memory");							\
> > +	uaccess_ttbr0_disable();					\
> > +									\
> > +	if (!ret)							\
> > +		*oval = oldval;						\
> > +									\
> > +	return ret;							\
> > +}
> > +
> > +LSUI_FUTEX_ATOMIC_OP(add, ldtadd, al)
> > +LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
> > +LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
> > +LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
> > +
> > +static __always_inline int
> > +__lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
> > +{
> > +	return __lsui_futex_atomic_andnot(~oparg, uaddr, oval);
> > +}
> > +
> > +static __always_inline int
> > +__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
> > +{
> > +	unsigned int loops = LLSC_MAX_LOOPS;
> > +	int ret, oldval, tmp;
> > +
> > +	uaccess_ttbr0_enable();
> > +	/*
> > +	 * there are no ldteor/stteor instructions...
> > +	 */
>
> *sigh*
>
> Were these new instructions not added with futex in mind?

rather than the futex, this seems to be designed for atomic_op()...
(like user version of LSE)...

That's why it seems no "eor" for this...

> I wonder whether CAS would be better than exclusives for xor...
>
> > +static __always_inline int
> > +__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> > +{
> > +	int ret = 0;
> > +	unsigned int loops = LLSC_MAX_LOOPS;
> > +	u32 val, tmp;
> > +
> > +	uaccess_ttbr0_enable();
> > +	/*
> > +	 * cas{al}t doesn't support word size...
> > +	 */
>
> What about just aligning down and doing a 64-bit cas in that case?

Though it applies with cas{al}t applying to
futex_eor() and futex_cmpxchg(), I think it still need to compare with
old value is the same at the time of load. that means the routine will
be the same for LLSC way like:

again:
   oldval = uaddr;
   oldval2 = oldval
   cas uaddr, oldval2, newval
   if (oldval != oldval2)
     goto again;

with the CAS feature, try cmpxchg if old was different,
returns -EAGAIN immediately seems not the same beheavior with
former __llsc_futext_atomic_op().

This patch's intension is "not to change former beheavior"
but removing change of PSTATE only.

If this beheavior change is allowed,
I'll replace them with CAS one with delight :

Thanks!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 1/6] arm64: cpufeature: add FEAT_LSUI
  2025-08-16 15:19 ` [PATCH RESEND v7 1/6] arm64: cpufeature: add FEAT_LSUI Yeoreum Yun
@ 2025-09-12 16:12   ` Catalin Marinas
  0 siblings, 0 replies; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 16:12 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:24PM +0100, Yeoreum Yun wrote:
> Since Armv9.6, FEAT_LSUI supplies load/store instructions
> for privileged level to access user memory without clearing PSTATE.PAN bit.
> 
> Add LSUI feature so that the unprevilieged load/store instructions
> could be used when kernel accesses user memory without clearing PSTATE.PAN bit.

These two paragraphs are pretty much saying the same thing. Just replace
the second with something like "Add CPU feature detection for
FEAT_LSUI". No need to explain the PSTATE.PAN again.

> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig
  2025-08-16 15:19 ` [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig Yeoreum Yun
@ 2025-09-12 16:24   ` Catalin Marinas
  2025-09-15 10:42     ` Yeoreum Yun
  0 siblings, 1 reply; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 16:24 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:26PM +0100, Yeoreum Yun wrote:
> Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
> previleged level to access to access user memory without clearing
> PSTATE.PAN bit.
> It's enough to add CONFIG_AS_HAS_LSUI only because the code for LSUI uses
> individual `.arch_extension` entries.

The subject could be improved slightly: Detect toolchain support for
LSUI.

> 
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> ---
>  arch/arm64/Kconfig | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index e9bbfacc35a6..c474de3dce02 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2239,6 +2239,11 @@ config ARM64_GCS
> 
>  endmenu # "v9.4 architectural features"
> 
> +config AS_HAS_LSUI
> +	def_bool $(as-instr,.arch_extension lsui)
> +	help
> +	 Supported by LLVM 20 and later, not yet supported by GNU AS.

binutils 2.45 added support for LSUI.

Nitpick: we tend to add a two-space indentation from "help".

Otherwise it looks fine:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 2/6] KVM: arm64: expose FEAT_LSUI to guest
  2025-08-16 15:19 ` [PATCH RESEND v7 2/6] KVM: arm64: expose FEAT_LSUI to guest Yeoreum Yun
@ 2025-09-12 16:25   ` Catalin Marinas
  0 siblings, 0 replies; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 16:25 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:25PM +0100, Yeoreum Yun wrote:
> expose FEAT_LSUI to guest.
> 
> Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> Acked-by: Marc Zyngier <maz@kernel.org>

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set()
  2025-09-11 16:19     ` Yeoreum Yun
@ 2025-09-12 16:36       ` Catalin Marinas
  2025-09-15 10:41         ` Yeoreum Yun
  0 siblings, 1 reply; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 16:36 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Will Deacon, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Sep 11, 2025 at 05:19:11PM +0100, Yeoreum Yun wrote:
> > > +static __always_inline int
> > > +__llsc_futex_atomic_set(int oparg, u32 __user *uaddr, int *oval)
[...]
> > Hmm, I'm really not sure this is worthwhile. I doubt the "optimisation"
> > actually does anything and adding a whole new block of asm just for the
> > SET case isn't much of an improvement on the maintainability side, either.
> 
> TBH, I had the same question, but I thought this code seems to modify
> freqenetly, I decide even a small optimisation -- reduce one instruction
> only.
> 
> But I don't have strong opinion for this patch.
> If it's not good for maintainability perspective,
> This patch can be dropped.

I'd drop it for now unless you can show some performance benefits
(unlikely).

-- 
Catalin


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-08-16 15:19 ` [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation Yeoreum Yun
  2025-09-11 15:38   ` Will Deacon
@ 2025-09-12 16:44   ` Catalin Marinas
  2025-09-12 17:01     ` Catalin Marinas
  2025-09-15 10:39     ` Yeoreum Yun
  2025-09-12 16:53   ` Catalin Marinas
  2 siblings, 2 replies; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 16:44 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> +#define FUTEX_ATOMIC_OP(op)						\
> +static __always_inline int						\
> +__futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)		\
> +{									\
> +	return __llsc_futex_atomic_##op(oparg, uaddr, oval);		\
> +}
> +
> +FUTEX_ATOMIC_OP(add)
> +FUTEX_ATOMIC_OP(or)
> +FUTEX_ATOMIC_OP(and)
> +FUTEX_ATOMIC_OP(eor)
> +FUTEX_ATOMIC_OP(set)
> +
> +static __always_inline int
> +__futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> +{
> +	return __llsc_futex_cmpxchg(uaddr, oldval, newval, oval);
> +}

The patch looks fine and my impression that the __futex_* functions will
be used in patch 6 to dispatch between lsui and llsc. But you got them
the other way around. I'll comment there. For this patch:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-08-16 15:19 ` [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation Yeoreum Yun
  2025-09-11 15:38   ` Will Deacon
  2025-09-12 16:44   ` Catalin Marinas
@ 2025-09-12 16:53   ` Catalin Marinas
  2025-09-15 10:32     ` Yeoreum Yun
  2 siblings, 1 reply; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 16:53 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> index bc06691d2062..ab7003cb4724 100644
> --- a/arch/arm64/include/asm/futex.h
> +++ b/arch/arm64/include/asm/futex.h
> @@ -7,17 +7,21 @@
> 
>  #include <linux/futex.h>
>  #include <linux/uaccess.h>
> +#include <linux/stringify.h>
> 
>  #include <asm/errno.h>
> 
> -#define FUTEX_MAX_LOOPS	128 /* What's the largest number you can think of? */
> +#define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */

I just noticed - you might as well leave the name as is here, especially
if in patch 6 you align down address and use CAS on a 64-bit value as
per Will's comment (and it's no longer LLSC). I think renaming this is
unnecessary.

-- 
Catalin


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-12 16:44   ` Catalin Marinas
@ 2025-09-12 17:01     ` Catalin Marinas
  2025-09-15 10:39     ` Yeoreum Yun
  1 sibling, 0 replies; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 17:01 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Fri, Sep 12, 2025 at 05:44:15PM +0100, Catalin Marinas wrote:
> On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> > +#define FUTEX_ATOMIC_OP(op)						\
> > +static __always_inline int						\
> > +__futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)		\
> > +{									\
> > +	return __llsc_futex_atomic_##op(oparg, uaddr, oval);		\
> > +}
> > +
> > +FUTEX_ATOMIC_OP(add)
> > +FUTEX_ATOMIC_OP(or)
> > +FUTEX_ATOMIC_OP(and)
> > +FUTEX_ATOMIC_OP(eor)
> > +FUTEX_ATOMIC_OP(set)
> > +
> > +static __always_inline int
> > +__futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> > +{
> > +	return __llsc_futex_cmpxchg(uaddr, oldval, newval, oval);
> > +}
> 
> The patch looks fine and my impression that the __futex_* functions will
> be used in patch 6 to dispatch between lsui and llsc. But you got them
> the other way around. I'll comment there.

Ah, ignore me, I misread patch 6. It looks good.

> For this patch:
> 
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI
  2025-08-16 15:19 ` [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI Yeoreum Yun
  2025-09-11 15:22   ` Will Deacon
@ 2025-09-12 17:09   ` Catalin Marinas
  2025-09-15  8:24     ` Yeoreum Yun
  1 sibling, 1 reply; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 17:09 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Sat, Aug 16, 2025 at 04:19:29PM +0100, Yeoreum Yun wrote:
> @@ -115,11 +117,137 @@ __llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
>  	return ret;
>  }
> 
> +#ifdef CONFIG_AS_HAS_LSUI
> +
> +#define __LSUI_PREAMBLE	".arch_extension lsui\n"
> +
> +#define LSUI_FUTEX_ATOMIC_OP(op, asm_op, mb)				\
> +static __always_inline int						\
> +__lsui_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> +{									\
> +	int ret = 0;							\
> +	int oldval;							\
> +									\
> +	uaccess_ttbr0_enable();						\

I think we can drop uaccess_ttbr0_*() from these functions. At the
kconfig level, TTBR0_PAN selects PAN. Hardware with LSUI will also
have PAN (since 8.1), so the above is an unnecessary branch or nop,
depending on how the alternatives play out. But add a comment instead.

> +static __always_inline int
> +__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
> +{
> +	unsigned int loops = LLSC_MAX_LOOPS;
> +	int ret, oldval, tmp;
> +
> +	uaccess_ttbr0_enable();
> +	/*
> +	 * there are no ldteor/stteor instructions...
> +	 */
> +	asm volatile("// __lsui_futex_atomic_eor\n"
> +	__LSUI_PREAMBLE
> +"	prfm	pstl1strm, %2\n"
> +"1:	ldtxr	%w1, %2\n"
> +"	eor	%w3, %w1, %w5\n"
> +"2:	stltxr	%w0, %w3, %2\n"
> +"	cbz	%w0, 3f\n"
> +"	sub	%w4, %w4, %w0\n"
> +"	cbnz	%w4, 1b\n"
> +"	mov	%w0, %w6\n"
> +"3:\n"
> +"	dmb	ish\n"
> +	_ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
> +	_ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
> +	: "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp),
> +	  "+r" (loops)
> +	: "r" (oparg), "Ir" (-EAGAIN)
> +	: "memory");
> +	uaccess_ttbr0_disable();
> +
> +	if (!ret)
> +		*oval = oldval;
> +
> +	return ret;
> +}

That's an unfortunate omission from the architecture.

> +#define __lsui_llsc_body(op, ...)					\
> +({									\
> +	alternative_has_cap_likely(ARM64_HAS_LSUI) ?			\
> +		__lsui_##op(__VA_ARGS__) : __llsc_##op(__VA_ARGS__);	\
> +})
> +
> +#else	/* CONFIG_AS_HAS_LSUI */
> +
> +#define __lsui_llsc_body(op, ...)	__llsc_##op(__VA_ARGS__)
> +
> +#endif	/* CONFIG_AS_HAS_LSUI */
> +
> +
>  #define FUTEX_ATOMIC_OP(op)						\
>  static __always_inline int						\
>  __futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)		\
>  {									\
> -	return __llsc_futex_atomic_##op(oparg, uaddr, oval);		\
> +	return __lsui_llsc_body(futex_atomic_##op, oparg, uaddr, oval);	\
>  }

That's what I got confused about. It looks fine:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI
  2025-09-11 15:22   ` Will Deacon
  2025-09-11 16:45     ` Yeoreum Yun
@ 2025-09-12 17:16     ` Catalin Marinas
  2025-09-15  9:15       ` Yeoreum Yun
  1 sibling, 1 reply; 51+ messages in thread
From: Catalin Marinas @ 2025-09-12 17:16 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yeoreum Yun, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Sep 11, 2025 at 04:22:24PM +0100, Will Deacon wrote:
> On Sat, Aug 16, 2025 at 04:19:29PM +0100, Yeoreum Yun wrote:
> > +static __always_inline int
> > +__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
> > +{
> > +	unsigned int loops = LLSC_MAX_LOOPS;
> > +	int ret, oldval, tmp;
> > +
> > +	uaccess_ttbr0_enable();
> > +	/*
> > +	 * there are no ldteor/stteor instructions...
> > +	 */
> 
> *sigh*
> 
> Were these new instructions not added with futex in mind?

I guess it was _most_ of the futex.

> I wonder whether CAS would be better than exclusives for xor...

I was first thinking we could share some of the code with
__futex_cmpxchg() but...

> > +static __always_inline int
> > +__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> > +{
> > +	int ret = 0;
> > +	unsigned int loops = LLSC_MAX_LOOPS;
> > +	u32 val, tmp;
> > +
> > +	uaccess_ttbr0_enable();
> > +	/*
> > +	 * cas{al}t doesn't support word size...
> > +	 */
> 
> What about just aligning down and doing a 64-bit cas in that case?

I think it gets more complicated. Here we get the oldval from the
caller, so no need to do a read. With CAS, we'd need to read the full
64-bit, replace half of it with oldval and newval just to be able to do
the operation. On top of this, we need to check which half of the 64-bit
value. I think it to hairy for little benefit.

-- 
Catalin


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI
  2025-09-12 17:09   ` Catalin Marinas
@ 2025-09-15  8:24     ` Yeoreum Yun
  0 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-15  8:24 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

Hi Catalin,

> On Sat, Aug 16, 2025 at 04:19:29PM +0100, Yeoreum Yun wrote:
> > @@ -115,11 +117,137 @@ __llsc_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> >  	return ret;
> >  }
> >
> > +#ifdef CONFIG_AS_HAS_LSUI
> > +
> > +#define __LSUI_PREAMBLE	".arch_extension lsui\n"
> > +
> > +#define LSUI_FUTEX_ATOMIC_OP(op, asm_op, mb)				\
> > +static __always_inline int						\
> > +__lsui_futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)	\
> > +{									\
> > +	int ret = 0;							\
> > +	int oldval;							\
> > +									\
> > +	uaccess_ttbr0_enable();						\
>
> I think we can drop uaccess_ttbr0_*() from these functions. At the
> kconfig level, TTBR0_PAN selects PAN. Hardware with LSUI will also
> have PAN (since 8.1), so the above is an unnecessary branch or nop,
> depending on how the alternatives play out. But add a comment instead.

Thanks to point out this.
I'll change it.

>
> > +static __always_inline int
> > +__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
> > +{
> > +	unsigned int loops = LLSC_MAX_LOOPS;
> > +	int ret, oldval, tmp;
> > +
> > +	uaccess_ttbr0_enable();
> > +	/*
> > +	 * there are no ldteor/stteor instructions...
> > +	 */
> > +	asm volatile("// __lsui_futex_atomic_eor\n"
> > +	__LSUI_PREAMBLE
> > +"	prfm	pstl1strm, %2\n"
> > +"1:	ldtxr	%w1, %2\n"
> > +"	eor	%w3, %w1, %w5\n"
> > +"2:	stltxr	%w0, %w3, %2\n"
> > +"	cbz	%w0, 3f\n"
> > +"	sub	%w4, %w4, %w0\n"
> > +"	cbnz	%w4, 1b\n"
> > +"	mov	%w0, %w6\n"
> > +"3:\n"
> > +"	dmb	ish\n"
> > +	_ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
> > +	_ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
> > +	: "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp),
> > +	  "+r" (loops)
> > +	: "r" (oparg), "Ir" (-EAGAIN)
> > +	: "memory");
> > +	uaccess_ttbr0_disable();
> > +
> > +	if (!ret)
> > +		*oval = oldval;
> > +
> > +	return ret;
> > +}
>
> That's an unfortunate omission from the architecture.
>
> > +#define __lsui_llsc_body(op, ...)					\
> > +({									\
> > +	alternative_has_cap_likely(ARM64_HAS_LSUI) ?			\
> > +		__lsui_##op(__VA_ARGS__) : __llsc_##op(__VA_ARGS__);	\
> > +})
> > +
> > +#else	/* CONFIG_AS_HAS_LSUI */
> > +
> > +#define __lsui_llsc_body(op, ...)	__llsc_##op(__VA_ARGS__)
> > +
> > +#endif	/* CONFIG_AS_HAS_LSUI */
> > +
> > +
> >  #define FUTEX_ATOMIC_OP(op)						\
> >  static __always_inline int						\
> >  __futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)		\
> >  {									\
> > -	return __llsc_futex_atomic_##op(oparg, uaddr, oval);		\
> > +	return __lsui_llsc_body(futex_atomic_##op, oparg, uaddr, oval);	\
> >  }
>
> That's what I got confused about. It looks fine:
>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks!
--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI
  2025-09-12 17:16     ` Catalin Marinas
@ 2025-09-15  9:15       ` Yeoreum Yun
  0 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-15  9:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi,

> On Thu, Sep 11, 2025 at 04:22:24PM +0100, Will Deacon wrote:
> > On Sat, Aug 16, 2025 at 04:19:29PM +0100, Yeoreum Yun wrote:
> > > +static __always_inline int
> > > +__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
> > > +{
> > > +	unsigned int loops = LLSC_MAX_LOOPS;
> > > +	int ret, oldval, tmp;
> > > +
> > > +	uaccess_ttbr0_enable();
> > > +	/*
> > > +	 * there are no ldteor/stteor instructions...
> > > +	 */
> >
> > *sigh*
> >
> > Were these new instructions not added with futex in mind?
>
> I guess it was _most_ of the futex.
>
> > I wonder whether CAS would be better than exclusives for xor...
>
> I was first thinking we could share some of the code with
> __futex_cmpxchg() but...
>
> > > +static __always_inline int
> > > +__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> > > +{
> > > +	int ret = 0;
> > > +	unsigned int loops = LLSC_MAX_LOOPS;
> > > +	u32 val, tmp;
> > > +
> > > +	uaccess_ttbr0_enable();
> > > +	/*
> > > +	 * cas{al}t doesn't support word size...
> > > +	 */
> >
> > What about just aligning down and doing a 64-bit cas in that case?
>
> I think it gets more complicated. Here we get the oldval from the
> caller, so no need to do a read. With CAS, we'd need to read the full
> 64-bit, replace half of it with oldval and newval just to be able to do
> the operation. On top of this, we need to check which half of the 64-bit
> value. I think it to hairy for little benefit.

Agree. also the unrelated to change for other 32 bit can make
a failure futex atomic operation.

So, I'll keep the llsc method even using lsui for cmpxchg and eor.

Thanks!
--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-12 16:53   ` Catalin Marinas
@ 2025-09-15 10:32     ` Yeoreum Yun
  2025-09-15 19:40       ` Catalin Marinas
  0 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-15 10:32 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

Hi Catalin,

> On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > index bc06691d2062..ab7003cb4724 100644
> > --- a/arch/arm64/include/asm/futex.h
> > +++ b/arch/arm64/include/asm/futex.h
> > @@ -7,17 +7,21 @@
> >
> >  #include <linux/futex.h>
> >  #include <linux/uaccess.h>
> > +#include <linux/stringify.h>
> >
> >  #include <asm/errno.h>
> >
> > -#define FUTEX_MAX_LOOPS	128 /* What's the largest number you can think of? */
> > +#define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
>
> I just noticed - you might as well leave the name as is here, especially
> if in patch 6 you align down address and use CAS on a 64-bit value as
> per Will's comment (and it's no longer LLSC). I think renaming this is
> unnecessary.

Okay. I'll restore to use origin name.
But I think LSUI wouldn't be used with CAS according to patch 6's
comments from you and additionally i think
chaning the CAS would make a failure because of
change of unrelated field. i.e)

struct user_structure{
  uint32 futex;
  uint32 some_value;
};

In this case, the change of some_value from user side could make a
failure of futex atomic operation.

So I think it would be better to keep the current LLSC implementation
in LSUI.

Thanks.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-12 16:44   ` Catalin Marinas
  2025-09-12 17:01     ` Catalin Marinas
@ 2025-09-15 10:39     ` Yeoreum Yun
  1 sibling, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-15 10:39 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

> On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> > +#define FUTEX_ATOMIC_OP(op)						\
> > +static __always_inline int						\
> > +__futex_atomic_##op(int oparg, u32 __user *uaddr, int *oval)		\
> > +{									\
> > +	return __llsc_futex_atomic_##op(oparg, uaddr, oval);		\
> > +}
> > +
> > +FUTEX_ATOMIC_OP(add)
> > +FUTEX_ATOMIC_OP(or)
> > +FUTEX_ATOMIC_OP(and)
> > +FUTEX_ATOMIC_OP(eor)
> > +FUTEX_ATOMIC_OP(set)
> > +
> > +static __always_inline int
> > +__futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> > +{
> > +	return __llsc_futex_cmpxchg(uaddr, oldval, newval, oval);
> > +}
>
> The patch looks fine and my impression that the __futex_* functions will
> be used in patch 6 to dispatch between lsui and llsc. But you got them
> the other way around. I'll comment there. For this patch:
>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set()
  2025-09-12 16:36       ` Catalin Marinas
@ 2025-09-15 10:41         ` Yeoreum Yun
  0 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-15 10:41 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Fri, Sep 12, 2025 at 05:36:20PM +0100, Catalin Marinas wrote:
> On Thu, Sep 11, 2025 at 05:19:11PM +0100, Yeoreum Yun wrote:
> > > > +static __always_inline int
> > > > +__llsc_futex_atomic_set(int oparg, u32 __user *uaddr, int *oval)
> [...]
> > > Hmm, I'm really not sure this is worthwhile. I doubt the "optimisation"
> > > actually does anything and adding a whole new block of asm just for the
> > > SET case isn't much of an improvement on the maintainability side, either.
> >
> > TBH, I had the same question, but I thought this code seems to modify
> > freqenetly, I decide even a small optimisation -- reduce one instruction
> > only.
> >
> > But I don't have strong opinion for this patch.
> > If it's not good for maintainability perspective,
> > This patch can be dropped.
>
> I'd drop it for now unless you can show some performance benefits
> (unlikely).

Yes. not much of improvement. So I'll drop this patch.
Thanks.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig
  2025-09-12 16:24   ` Catalin Marinas
@ 2025-09-15 10:42     ` Yeoreum Yun
  2025-09-15 11:32       ` Will Deacon
  0 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-15 10:42 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

Hi Catalin,

> On Sat, Aug 16, 2025 at 04:19:26PM +0100, Yeoreum Yun wrote:
> > Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
> > previleged level to access to access user memory without clearing
> > PSTATE.PAN bit.
> > It's enough to add CONFIG_AS_HAS_LSUI only because the code for LSUI uses
> > individual `.arch_extension` entries.
>
> The subject could be improved slightly: Detect toolchain support for
> LSUI.
>
> >
> > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > ---
> >  arch/arm64/Kconfig | 5 +++++
> >  1 file changed, 5 insertions(+)
> >
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index e9bbfacc35a6..c474de3dce02 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2239,6 +2239,11 @@ config ARM64_GCS
> >
> >  endmenu # "v9.4 architectural features"
> >
> > +config AS_HAS_LSUI
> > +	def_bool $(as-instr,.arch_extension lsui)
> > +	help
> > +	 Supported by LLVM 20 and later, not yet supported by GNU AS.
>
> binutils 2.45 added support for LSUI.
>
> Nitpick: we tend to add a two-space indentation from "help".
>
> Otherwise it looks fine:
>
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks! I'll change the Nitpick and send again.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig
  2025-09-15 10:42     ` Yeoreum Yun
@ 2025-09-15 11:32       ` Will Deacon
  2025-09-15 11:41         ` Yeoreum Yun
  0 siblings, 1 reply; 51+ messages in thread
From: Will Deacon @ 2025-09-15 11:32 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Catalin Marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Mon, Sep 15, 2025 at 11:42:12AM +0100, Yeoreum Yun wrote:
> > On Sat, Aug 16, 2025 at 04:19:26PM +0100, Yeoreum Yun wrote:
> > > Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
> > > previleged level to access to access user memory without clearing
> > > PSTATE.PAN bit.
> > > It's enough to add CONFIG_AS_HAS_LSUI only because the code for LSUI uses
> > > individual `.arch_extension` entries.
> >
> > The subject could be improved slightly: Detect toolchain support for
> > LSUI.
> >
> > >
> > > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > > ---
> > >  arch/arm64/Kconfig | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > >
> > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > index e9bbfacc35a6..c474de3dce02 100644
> > > --- a/arch/arm64/Kconfig
> > > +++ b/arch/arm64/Kconfig
> > > @@ -2239,6 +2239,11 @@ config ARM64_GCS
> > >
> > >  endmenu # "v9.4 architectural features"
> > >
> > > +config AS_HAS_LSUI
> > > +	def_bool $(as-instr,.arch_extension lsui)
> > > +	help
> > > +	 Supported by LLVM 20 and later, not yet supported by GNU AS.
> >
> > binutils 2.45 added support for LSUI.
> >
> > Nitpick: we tend to add a two-space indentation from "help".
> >
> > Otherwise it looks fine:
> >
> > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> Thanks! I'll change the Nitpick and send again.

Please don't resend until we've concluded on the CAS discussion.

Will


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig
  2025-09-15 11:32       ` Will Deacon
@ 2025-09-15 11:41         ` Yeoreum Yun
  0 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-15 11:41 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi Will,

> On Mon, Sep 15, 2025 at 11:42:12AM +0100, Yeoreum Yun wrote:
> > > On Sat, Aug 16, 2025 at 04:19:26PM +0100, Yeoreum Yun wrote:
> > > > Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
> > > > previleged level to access to access user memory without clearing
> > > > PSTATE.PAN bit.
> > > > It's enough to add CONFIG_AS_HAS_LSUI only because the code for LSUI uses
> > > > individual `.arch_extension` entries.
> > >
> > > The subject could be improved slightly: Detect toolchain support for
> > > LSUI.
> > >
> > > >
> > > > Signed-off-by: Yeoreum Yun <yeoreum.yun@arm.com>
> > > > ---
> > > >  arch/arm64/Kconfig | 5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > >
> > > > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > > > index e9bbfacc35a6..c474de3dce02 100644
> > > > --- a/arch/arm64/Kconfig
> > > > +++ b/arch/arm64/Kconfig
> > > > @@ -2239,6 +2239,11 @@ config ARM64_GCS
> > > >
> > > >  endmenu # "v9.4 architectural features"
> > > >
> > > > +config AS_HAS_LSUI
> > > > +	def_bool $(as-instr,.arch_extension lsui)
> > > > +	help
> > > > +	 Supported by LLVM 20 and later, not yet supported by GNU AS.
> > >
> > > binutils 2.45 added support for LSUI.
> > >
> > > Nitpick: we tend to add a two-space indentation from "help".
> > >
> > > Otherwise it looks fine:
> > >
> > > Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> >
> > Thanks! I'll change the Nitpick and send again.
>
> Please don't resend until we've concluded on the CAS discussion.

Yes. so I'm waiting for your comment for:
  - https://lore.kernel.org/all/aMfZI2VVV5zEsTna@e129823.arm.com/
  - https://lore.kernel.org/all/aMfrR0vserl%2FhfZ3@e129823.arm.com/

Thanks.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-15 10:32     ` Yeoreum Yun
@ 2025-09-15 19:40       ` Catalin Marinas
  2025-09-15 20:35         ` Will Deacon
  2025-09-15 22:34         ` Yeoreum Yun
  0 siblings, 2 replies; 51+ messages in thread
From: Catalin Marinas @ 2025-09-15 19:40 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Mon, Sep 15, 2025 at 11:32:39AM +0100, Yeoreum Yun wrote:
> > On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> > > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > > index bc06691d2062..ab7003cb4724 100644
> > > --- a/arch/arm64/include/asm/futex.h
> > > +++ b/arch/arm64/include/asm/futex.h
> > > @@ -7,17 +7,21 @@
> > >
> > >  #include <linux/futex.h>
> > >  #include <linux/uaccess.h>
> > > +#include <linux/stringify.h>
> > >
> > >  #include <asm/errno.h>
> > >
> > > -#define FUTEX_MAX_LOOPS	128 /* What's the largest number you can think of? */
> > > +#define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
> >
> > I just noticed - you might as well leave the name as is here, especially
> > if in patch 6 you align down address and use CAS on a 64-bit value as
> > per Will's comment (and it's no longer LLSC). I think renaming this is
> > unnecessary.
> 
> Okay. I'll restore to use origin name.
> But I think LSUI wouldn't be used with CAS according to patch 6's
> comments from you and additionally i think
> chaning the CAS would make a failure because of
> change of unrelated field. i.e)
> 
> struct user_structure{
>   uint32 futex;
>   uint32 some_value;
> };
> 
> In this case, the change of some_value from user side could make a
> failure of futex atomic operation.

Yes but the loop would read 'some_value' again, fold in 'futex' and
retry. It would eventually succeed or fail after 128 iterations if the
user keeps changing that location. Note that's also the case with LL/SC,
the exclusive monitor would be cleared by some store in the same cache
line (well, depending on the hardware implementation) and the STXR fail.
From this perspective, CAS has better chance of succeeding.

> So I think it would be better to keep the current LLSC implementation
> in LSUI.

I think the code would look simpler with LL/SC but you can give it a try
and post the code sample here (not in a new series).

BTW, is there a test suite for all the futex operations? The cover
letter did not mention any.

-- 
Catalin


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-15 19:40       ` Catalin Marinas
@ 2025-09-15 20:35         ` Will Deacon
  2025-09-16  7:02           ` Catalin Marinas
  2025-09-15 22:34         ` Yeoreum Yun
  1 sibling, 1 reply; 51+ messages in thread
From: Will Deacon @ 2025-09-15 20:35 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Yeoreum Yun, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Mon, Sep 15, 2025 at 08:40:33PM +0100, Catalin Marinas wrote:
> On Mon, Sep 15, 2025 at 11:32:39AM +0100, Yeoreum Yun wrote:
> > > On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> > > > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > > > index bc06691d2062..ab7003cb4724 100644
> > > > --- a/arch/arm64/include/asm/futex.h
> > > > +++ b/arch/arm64/include/asm/futex.h
> > > > @@ -7,17 +7,21 @@
> > > >
> > > >  #include <linux/futex.h>
> > > >  #include <linux/uaccess.h>
> > > > +#include <linux/stringify.h>
> > > >
> > > >  #include <asm/errno.h>
> > > >
> > > > -#define FUTEX_MAX_LOOPS	128 /* What's the largest number you can think of? */
> > > > +#define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
> > >
> > > I just noticed - you might as well leave the name as is here, especially
> > > if in patch 6 you align down address and use CAS on a 64-bit value as
> > > per Will's comment (and it's no longer LLSC). I think renaming this is
> > > unnecessary.
> > 
> > Okay. I'll restore to use origin name.
> > But I think LSUI wouldn't be used with CAS according to patch 6's
> > comments from you and additionally i think
> > chaning the CAS would make a failure because of
> > change of unrelated field. i.e)
> > 
> > struct user_structure{
> >   uint32 futex;
> >   uint32 some_value;
> > };
> > 
> > In this case, the change of some_value from user side could make a
> > failure of futex atomic operation.
> 
> Yes but the loop would read 'some_value' again, fold in 'futex' and
> retry. It would eventually succeed or fail after 128 iterations if the
> user keeps changing that location. Note that's also the case with LL/SC,
> the exclusive monitor would be cleared by some store in the same cache
> line (well, depending on the hardware implementation) and the STXR fail.
> From this perspective, CAS has better chance of succeeding.
> 
> > So I think it would be better to keep the current LLSC implementation
> > in LSUI.
> 
> I think the code would look simpler with LL/SC but you can give it a try
> and post the code sample here (not in a new series).

If you stick the cas*t instruction in its own helper say, cmpxchg_user(),
then you can do all the shifting/masking in C and I don't reckon it's
that bad. It means we (a) get rid of exclusives, which is the whole
point of this and (b) don't have to mess around with PAN.

> BTW, is there a test suite for all the futex operations? The cover
> letter did not mention any.

I was thinking that too. I'm sure I remember a 'futextest' kicking
around when we did the arm64 port but nowadays there's something in
tools/testing/selftests/futex/ which might be better.

Will


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops
  2025-09-11 16:22   ` Catalin Marinas
@ 2025-09-15 20:37     ` Will Deacon
  0 siblings, 0 replies; 51+ messages in thread
From: Will Deacon @ 2025-09-15 20:37 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Yeoreum Yun, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Thu, Sep 11, 2025 at 05:22:22PM +0100, Catalin Marinas wrote:
> On Thu, Sep 11, 2025 at 04:09:42PM +0100, Will Deacon wrote:
> > On Sat, Aug 16, 2025 at 04:19:23PM +0100, Yeoreum Yun wrote:
> > > Since Armv9.6, FEAT_LSUI supplies the load/store instructions for
> > > previleged level to access to access user memory without clearing
> > > PSTATE.PAN bit.
> > > 
> > > This patchset support FEAT_LSUI and applies in futex atomic operation
> > > where can replace from ldxr/stlxr pair implmentation with clearing
> > > PSTATE.PAN bit to correspondant load/store unprevileged atomic operation
> > > without clearing PSTATE.PAN bit.
> > > 
> > > (Sorry, I've sent wrongly for patch version 7 and resend it.
> > >  Again, sorry for mail-boom).
> > 
> > I tried to review this but I can't find any details about FEAT_LSUI in
> > the latest Arm ARM. Where should I be looking for the architecture spec?
> 
> Unfortunately, it's just in the public xml at the moment. Hopefully
> we'll get a release of the Arm ARM by the end of the year. Otherwise, in
> the private 2024 arch spec. Not ideal though.

Ah, thanks, the xml is a *lot* better than nothing.

> If you'd rather wait until in turns up in the public spec, fine by me.

As long as you're happy to continue helping with the review, I think
it's fine.

Cheers,

Will


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-15 19:40       ` Catalin Marinas
  2025-09-15 20:35         ` Will Deacon
@ 2025-09-15 22:34         ` Yeoreum Yun
  2025-09-16 12:53           ` Catalin Marinas
  1 sibling, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-15 22:34 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

Hi Catalin,

> > > On Sat, Aug 16, 2025 at 04:19:27PM +0100, Yeoreum Yun wrote:
> > > > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > > > index bc06691d2062..ab7003cb4724 100644
> > > > --- a/arch/arm64/include/asm/futex.h
> > > > +++ b/arch/arm64/include/asm/futex.h
> > > > @@ -7,17 +7,21 @@
> > > >
> > > >  #include <linux/futex.h>
> > > >  #include <linux/uaccess.h>
> > > > +#include <linux/stringify.h>
> > > >
> > > >  #include <asm/errno.h>
> > > >
> > > > -#define FUTEX_MAX_LOOPS	128 /* What's the largest number you can think of? */
> > > > +#define LLSC_MAX_LOOPS	128 /* What's the largest number you can think of? */
> > >
> > > I just noticed - you might as well leave the name as is here, especially
> > > if in patch 6 you align down address and use CAS on a 64-bit value as
> > > per Will's comment (and it's no longer LLSC). I think renaming this is
> > > unnecessary.
> >
> > Okay. I'll restore to use origin name.
> > But I think LSUI wouldn't be used with CAS according to patch 6's
> > comments from you and additionally i think
> > chaning the CAS would make a failure because of
> > change of unrelated field. i.e)
> >
> > struct user_structure{
> >   uint32 futex;
> >   uint32 some_value;
> > };
> >
> > In this case, the change of some_value from user side could make a
> > failure of futex atomic operation.
>
> Yes but the loop would read 'some_value' again, fold in 'futex' and
> retry. It would eventually succeed or fail after 128 iterations if the
> user keeps changing that location. Note that's also the case with LL/SC,
> the exclusive monitor would be cleared by some store in the same cache
> line (well, depending on the hardware implementation) and the STXR fail.
> From this perspective, CAS has better chance of succeeding.

Oh. I see Thanks for insight ;)

>
> > So I think it would be better to keep the current LLSC implementation
> > in LSUI.
>
> I think the code would look simpler with LL/SC but you can give it a try
> and post the code sample here (not in a new series).

Okay. I'll try.

>
> BTW, is there a test suite for all the futex operations? The cover
> letter did not mention any.

with selftest's futex testcase, I've tested.
But Since there is no test for each operation even in LTP,
I tested it with additional test from me.

>
> --
> Catalin

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-15 20:35         ` Will Deacon
@ 2025-09-16  7:02           ` Catalin Marinas
  2025-09-16  9:15             ` Yeoreum Yun
  2025-09-16 10:02             ` Yeoreum Yun
  0 siblings, 2 replies; 51+ messages in thread
From: Catalin Marinas @ 2025-09-16  7:02 UTC (permalink / raw)
  To: Will Deacon
  Cc: Yeoreum Yun, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Mon, Sep 15, 2025 at 09:35:55PM +0100, Will Deacon wrote:
> On Mon, Sep 15, 2025 at 08:40:33PM +0100, Catalin Marinas wrote:
> > On Mon, Sep 15, 2025 at 11:32:39AM +0100, Yeoreum Yun wrote:
> > > So I think it would be better to keep the current LLSC implementation
> > > in LSUI.
> > 
> > I think the code would look simpler with LL/SC but you can give it a try
> > and post the code sample here (not in a new series).
> 
> If you stick the cas*t instruction in its own helper say, cmpxchg_user(),
> then you can do all the shifting/masking in C and I don't reckon it's
> that bad. It means we (a) get rid of exclusives, which is the whole
> point of this and (b) don't have to mess around with PAN.

We get rid of PAN toggling already since FEAT_LSUI introduces
LDTXR/STTXR. But, I'm all for CAS if it doesn't look too bad. Easier
I think if we do a get_user() of a u64 and combine it with the futex u32
while taking care of CPU endianness. All in a loop. Hopefully the
compiler is smart enough to reduce masking/or'ing to fewer instructions.

-- 
Catalin


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16  7:02           ` Catalin Marinas
@ 2025-09-16  9:15             ` Yeoreum Yun
  2025-09-16  9:24               ` Yeoreum Yun
  2025-09-16 10:02             ` Yeoreum Yun
  1 sibling, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-16  9:15 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi,

> On Mon, Sep 15, 2025 at 09:35:55PM +0100, Will Deacon wrote:
> > On Mon, Sep 15, 2025 at 08:40:33PM +0100, Catalin Marinas wrote:
> > > On Mon, Sep 15, 2025 at 11:32:39AM +0100, Yeoreum Yun wrote:
> > > > So I think it would be better to keep the current LLSC implementation
> > > > in LSUI.
> > >
> > > I think the code would look simpler with LL/SC but you can give it a try
> > > and post the code sample here (not in a new series).
> >
> > If you stick the cas*t instruction in its own helper say, cmpxchg_user(),
> > then you can do all the shifting/masking in C and I don't reckon it's
> > that bad. It means we (a) get rid of exclusives, which is the whole
> > point of this and (b) don't have to mess around with PAN.
>
> We get rid of PAN toggling already since FEAT_LSUI introduces
> LDTXR/STTXR. But, I'm all for CAS if it doesn't look too bad. Easier
> I think if we do a get_user() of a u64 and combine it with the futex u32
> while taking care of CPU endianness. All in a loop. Hopefully the
> compiler is smart enough to reduce masking/or'ing to fewer instructions.
>

Hmm, I think sure shifting/masking can be replace by single bfi
instruction like:

diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
index 1d6d9f856ac5..30da0006c0c8 100644
--- a/arch/arm64/include/asm/futex.h
+++ b/arch/arm64/include/asm/futex.h
@@ -126,6 +126,59 @@ LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
 LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
 LSUI_FUTEX_ATOMIC_OP(set, swpt, al)

+
+#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
+static __always_inline int                                                     \
+__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
+{                                                                              \
+       int ret = 0;                                                            \
+       u64 oval, nval, tmp;                                                    \
+                                                                               \
+       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
+       __LSUI_PREAMBLE                                                         \
+"      prfm    pstl1strm, %2\n"                                                \
+"1:    ldtr    %x1, %2\n"                                                      \
+"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
+"      bfi     %x1, %x6, #" #start_bit ", #32\n"                               \
+"      mov     %x4, %x5\n"                                                     \
+"2:    caslt   %x5, %x6, %2\n"                                                 \
+"      sub     %x4, %x4, %x5\n"                                                \
+"      cbz     %x4, 3f\n"                                                      \
+"      mov     %w0, %w7\n"                                                     \
+"3:\n"                                                                         \
+"      dmb     ish\n"                                                          \
+"4:\n"                                                                         \
+       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
+       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
+       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
+       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
+       : "memory");                                                            \
+                                                                               \
+       return ret;                                                             \
+}
+
+LSUI_CMPXCHG_HELPER(lo, 0)
+LSUI_CMPXCHG_HELPER(hi, 32)
+
+static __always_inline int
+__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
+{
+       int ret;
+       unsigned long uaddr_al;
+
+       uaddr_al = ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
+
+       if (uaddr_al != (unsigned long)uaddr)
+               ret = __lsui_cmpxchg_helper_hi((u64 __user *)uaddr_al, oldval, newval);
+       else
+               ret = __lsui_cmpxchg_helper_lo((u64 __user *)uaddr_al, oldval, newval);
+
+       if (!ret)
+               *oval = oldval;
+
+       return ret;
+}
+
 static __always_inline int
 __lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
 {
@@ -135,71 +188,25 @@ __lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
 static __always_inline int
 __lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
 {
-       unsigned int loops = FUTEX_MAX_LOOPS;
-       int ret, oldval, tmp;
+       int ret = -EAGAIN;
+       u32 oldval, newval;

        /*
         * there are no ldteor/stteor instructions...
         */
-       asm volatile("// __lsui_futex_atomic_eor\n"
-       __LSUI_PREAMBLE
-"      prfm    pstl1strm, %2\n"
-"1:    ldtxr   %w1, %2\n"
-"      eor     %w3, %w1, %w5\n"
-"2:    stltxr  %w0, %w3, %2\n"
-"      cbz     %w0, 3f\n"
-"      sub     %w4, %w4, %w0\n"
-"      cbnz    %w4, 1b\n"
-"      mov     %w0, %w6\n"
-"3:\n"
-"      dmb     ish\n"
-       _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
-       _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
-       : "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp),
-         "+r" (loops)
-       : "r" (oparg), "Ir" (-EAGAIN)
-       : "memory");
+       unsafe_get_user(oldval, uaddr, err_fault);
+       newval = oldval ^ oparg;

-       if (!ret)
-               *oval = oldval;
+       ret = __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);

+err_fault:
        return ret;
 }

 static __always_inline int
 __lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
 {
-       int ret = 0;
-       unsigned int loops = FUTEX_MAX_LOOPS;
-       u32 val, tmp;
-
-       /*
-        * cas{al}t doesn't support word size...
-        */
-       asm volatile("//__lsui_futex_cmpxchg\n"
-       __LSUI_PREAMBLE
-"      prfm    pstl1strm, %2\n"
-"1:    ldtxr   %w1, %2\n"
-"      eor     %w3, %w1, %w5\n"
-"      cbnz    %w3, 4f\n"
-"2:    stltxr  %w3, %w6, %2\n"
-"      cbz     %w3, 3f\n"
-"      sub     %w4, %w4, %w3\n"
-"      cbnz    %w4, 1b\n"
-"      mov     %w0, %w7\n"
-"3:\n"
-"      dmb     ish\n"
-"4:\n"
-       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
-       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
-       : "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
-       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
-       : "memory");
-
-       if (!ret)
-               *oval = oldval;
-
-       return ret;
+       return __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);
 }

 #define __lsui_llsc_body(op, ...)                                      \


This is based on the patch #6.

Am I missing something?

--
Sincerely,
Yeoreum Yun


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16  9:15             ` Yeoreum Yun
@ 2025-09-16  9:24               ` Yeoreum Yun
  0 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-16  9:24 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Sorry, Ignore this. I've sent wrong this :(
I'll send it again.

> Hi,
>
> > On Mon, Sep 15, 2025 at 09:35:55PM +0100, Will Deacon wrote:
> > > On Mon, Sep 15, 2025 at 08:40:33PM +0100, Catalin Marinas wrote:
> > > > On Mon, Sep 15, 2025 at 11:32:39AM +0100, Yeoreum Yun wrote:
> > > > > So I think it would be better to keep the current LLSC implementation
> > > > > in LSUI.
> > > >
> > > > I think the code would look simpler with LL/SC but you can give it a try
> > > > and post the code sample here (not in a new series).
> > >
> > > If you stick the cas*t instruction in its own helper say, cmpxchg_user(),
> > > then you can do all the shifting/masking in C and I don't reckon it's
> > > that bad. It means we (a) get rid of exclusives, which is the whole
> > > point of this and (b) don't have to mess around with PAN.
> >
> > We get rid of PAN toggling already since FEAT_LSUI introduces
> > LDTXR/STTXR. But, I'm all for CAS if it doesn't look too bad. Easier
> > I think if we do a get_user() of a u64 and combine it with the futex u32
> > while taking care of CPU endianness. All in a loop. Hopefully the
> > compiler is smart enough to reduce masking/or'ing to fewer instructions.
> >
>
> Hmm, I think sure shifting/masking can be replace by single bfi
> instruction like:
>
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> index 1d6d9f856ac5..30da0006c0c8 100644
> --- a/arch/arm64/include/asm/futex.h
> +++ b/arch/arm64/include/asm/futex.h
> @@ -126,6 +126,59 @@ LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
>  LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
>  LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
>
> +
> +#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
> +static __always_inline int                                                     \
> +__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
> +{                                                                              \
> +       int ret = 0;                                                            \
> +       u64 oval, nval, tmp;                                                    \
> +                                                                               \
> +       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
> +       __LSUI_PREAMBLE                                                         \
> +"      prfm    pstl1strm, %2\n"                                                \
> +"1:    ldtr    %x1, %2\n"                                                      \
> +"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
> +"      bfi     %x1, %x6, #" #start_bit ", #32\n"                               \
> +"      mov     %x4, %x5\n"                                                     \
> +"2:    caslt   %x5, %x6, %2\n"                                                 \
> +"      sub     %x4, %x4, %x5\n"                                                \
> +"      cbz     %x4, 3f\n"                                                      \
> +"      mov     %w0, %w7\n"                                                     \
> +"3:\n"                                                                         \
> +"      dmb     ish\n"                                                          \
> +"4:\n"                                                                         \
> +       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
> +       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
> +       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
> +       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
> +       : "memory");                                                            \
> +                                                                               \
> +       return ret;                                                             \
> +}
> +
> +LSUI_CMPXCHG_HELPER(lo, 0)
> +LSUI_CMPXCHG_HELPER(hi, 32)
> +
> +static __always_inline int
> +__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> +{
> +       int ret;
> +       unsigned long uaddr_al;
> +
> +       uaddr_al = ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
> +
> +       if (uaddr_al != (unsigned long)uaddr)
> +               ret = __lsui_cmpxchg_helper_hi((u64 __user *)uaddr_al, oldval, newval);
> +       else
> +               ret = __lsui_cmpxchg_helper_lo((u64 __user *)uaddr_al, oldval, newval);
> +
> +       if (!ret)
> +               *oval = oldval;
> +
> +       return ret;
> +}
> +
>  static __always_inline int
>  __lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
>  {
> @@ -135,71 +188,25 @@ __lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
>  static __always_inline int
>  __lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
>  {
> -       unsigned int loops = FUTEX_MAX_LOOPS;
> -       int ret, oldval, tmp;
> +       int ret = -EAGAIN;
> +       u32 oldval, newval;
>
>         /*
>          * there are no ldteor/stteor instructions...
>          */
> -       asm volatile("// __lsui_futex_atomic_eor\n"
> -       __LSUI_PREAMBLE
> -"      prfm    pstl1strm, %2\n"
> -"1:    ldtxr   %w1, %2\n"
> -"      eor     %w3, %w1, %w5\n"
> -"2:    stltxr  %w0, %w3, %2\n"
> -"      cbz     %w0, 3f\n"
> -"      sub     %w4, %w4, %w0\n"
> -"      cbnz    %w4, 1b\n"
> -"      mov     %w0, %w6\n"
> -"3:\n"
> -"      dmb     ish\n"
> -       _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
> -       _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
> -       : "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp),
> -         "+r" (loops)
> -       : "r" (oparg), "Ir" (-EAGAIN)
> -       : "memory");
> +       unsafe_get_user(oldval, uaddr, err_fault);
> +       newval = oldval ^ oparg;
>
> -       if (!ret)
> -               *oval = oldval;
> +       ret = __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);
>
> +err_fault:
>         return ret;
>  }
>
>  static __always_inline int
>  __lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
>  {
> -       int ret = 0;
> -       unsigned int loops = FUTEX_MAX_LOOPS;
> -       u32 val, tmp;
> -
> -       /*
> -        * cas{al}t doesn't support word size...
> -        */
> -       asm volatile("//__lsui_futex_cmpxchg\n"
> -       __LSUI_PREAMBLE
> -"      prfm    pstl1strm, %2\n"
> -"1:    ldtxr   %w1, %2\n"
> -"      eor     %w3, %w1, %w5\n"
> -"      cbnz    %w3, 4f\n"
> -"2:    stltxr  %w3, %w6, %2\n"
> -"      cbz     %w3, 3f\n"
> -"      sub     %w4, %w4, %w3\n"
> -"      cbnz    %w4, 1b\n"
> -"      mov     %w0, %w7\n"
> -"3:\n"
> -"      dmb     ish\n"
> -"4:\n"
> -       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
> -       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
> -       : "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
> -       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
> -       : "memory");
> -
> -       if (!ret)
> -               *oval = oldval;
> -
> -       return ret;
> +       return __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);
>  }
>
>  #define __lsui_llsc_body(op, ...)                                      \
>
>
> This is based on the patch #6.
>
> Am I missing something?
>
> --
> Sincerely,
> Yeoreum Yun
>

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16  7:02           ` Catalin Marinas
  2025-09-16  9:15             ` Yeoreum Yun
@ 2025-09-16 10:02             ` Yeoreum Yun
  2025-09-16 10:16               ` Will Deacon
  2025-09-16 12:47               ` Mark Rutland
  1 sibling, 2 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-16 10:02 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, broonie, maz, oliver.upton, joey.gouly, james.morse,
	ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi,
> On Mon, Sep 15, 2025 at 09:35:55PM +0100, Will Deacon wrote:
> > On Mon, Sep 15, 2025 at 08:40:33PM +0100, Catalin Marinas wrote:
> > > On Mon, Sep 15, 2025 at 11:32:39AM +0100, Yeoreum Yun wrote:
> > > > So I think it would be better to keep the current LLSC implementation
> > > > in LSUI.
> > >
> > > I think the code would look simpler with LL/SC but you can give it a try
> > > and post the code sample here (not in a new series).
> >
> > If you stick the cas*t instruction in its own helper say, cmpxchg_user(),
> > then you can do all the shifting/masking in C and I don't reckon it's
> > that bad. It means we (a) get rid of exclusives, which is the whole
> > point of this and (b) don't have to mess around with PAN.
>
> We get rid of PAN toggling already since FEAT_LSUI introduces
> LDTXR/STTXR. But, I'm all for CAS if it doesn't look too bad. Easier
> I think if we do a get_user() of a u64 and combine it with the futex u32
> while taking care of CPU endianness. All in a loop. Hopefully the
> compiler is smart enough to reduce masking/or'ing to fewer instructions.
>

Sorry for my wrong previous email.

Here is what i intened with CAS operation:

diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
index 1d6d9f856ac5..0aeda7ced2c0 100644
--- a/arch/arm64/include/asm/futex.h
+++ b/arch/arm64/include/asm/futex.h
@@ -126,6 +126,60 @@ LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
 LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
 LSUI_FUTEX_ATOMIC_OP(set, swpt, al)

+
+#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
+static __always_inline int                                                     \
+__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
+{                                                                              \
+       int ret = 0;                                                            \
+       u64 oval, nval, tmp;                                                    \
+                                                                               \
+       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
+       __LSUI_PREAMBLE                                                         \
+"      prfm    pstl1strm, %2\n"                                                \
+"1:    ldtr    %x1, %2\n"                                                      \
+"      mov     %x3, %x1\n"                                                     \
+"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
+"      bfi     %x3, %x6, #" #start_bit ", #32\n"                               \
+"      mov     %x4, %x1\n"                                                     \
+"2:    caslt   %x1, %x3, %2\n"                                                 \
+"      sub     %x1, %x1, %x4\n"                                                \
+"      cbz     %x1, 3f\n"                                                      \
+"      mov     %w0, %w7\n"                                                     \
+"3:\n"                                                                         \
+"      dmb     ish\n"                                                          \
+"4:\n"                                                                         \
+       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
+       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
+       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
+       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
+       : "memory");                                                            \
+                                                                               \
+       return ret;                                                             \
+}
+
+LSUI_CMPXCHG_HELPER(lo, 0)
+LSUI_CMPXCHG_HELPER(hi, 32)
+
+static __always_inline int
+__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
+{
+       int ret;
+       unsigned long uaddr_al;
+
+       uaddr_al = ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
+
+       if (uaddr_al != (unsigned long)uaddr)
+               ret = __lsui_cmpxchg_helper_hi((u64 __user *)uaddr_al, oldval, newval);
+       else
+               ret = __lsui_cmpxchg_helper_lo((u64 __user *)uaddr_al, oldval, newval);
+
+       if (!ret)
+               *oval = oldval;
+
+       return ret;
+}
+
 static __always_inline int
 __lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
 {
@@ -135,71 +189,25 @@ __lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
 static __always_inline int
 __lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
 {
-       unsigned int loops = FUTEX_MAX_LOOPS;
-       int ret, oldval, tmp;
+       int ret = -EAGAIN;
+       u32 oldval, newval;

        /*
         * there are no ldteor/stteor instructions...
         */
-       asm volatile("// __lsui_futex_atomic_eor\n"
-       __LSUI_PREAMBLE
-"      prfm    pstl1strm, %2\n"
-"1:    ldtxr   %w1, %2\n"
-"      eor     %w3, %w1, %w5\n"
-"2:    stltxr  %w0, %w3, %2\n"
-"      cbz     %w0, 3f\n"
-"      sub     %w4, %w4, %w0\n"
-"      cbnz    %w4, 1b\n"
-"      mov     %w0, %w6\n"
-"3:\n"
-"      dmb     ish\n"
-       _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
-       _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
-       : "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp),
-         "+r" (loops)
-       : "r" (oparg), "Ir" (-EAGAIN)
-       : "memory");
+       unsafe_get_user(oldval, uaddr, err_fault);
+       newval = oldval ^ oparg;

-       if (!ret)
-               *oval = oldval;
+       ret = __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);

+err_fault:
        return ret;
 }

 static __always_inline int
 __lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
 {
-       int ret = 0;
-       unsigned int loops = FUTEX_MAX_LOOPS;
-       u32 val, tmp;
-
-       /*
-        * cas{al}t doesn't support word size...
-        */
-       asm volatile("//__lsui_futex_cmpxchg\n"
-       __LSUI_PREAMBLE
-"      prfm    pstl1strm, %2\n"
-"1:    ldtxr   %w1, %2\n"
-"      eor     %w3, %w1, %w5\n"
-"      cbnz    %w3, 4f\n"
-"2:    stltxr  %w3, %w6, %2\n"
-"      cbz     %w3, 3f\n"
-"      sub     %w4, %w4, %w3\n"
-"      cbnz    %w4, 1b\n"
-"      mov     %w0, %w7\n"
-"3:\n"
-"      dmb     ish\n"
-"4:\n"
-       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
-       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
-       : "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
-       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
-       : "memory");
-
-       if (!ret)
-               *oval = oldval;
-
-       return ret;
+       return __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);
 }

 #define __lsui_llsc_body(op, ...)                                      \
(END)

Am I missing something?

Thanks!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 10:02             ` Yeoreum Yun
@ 2025-09-16 10:16               ` Will Deacon
  2025-09-16 12:50                 ` Yeoreum Yun
  2025-09-16 12:47               ` Mark Rutland
  1 sibling, 1 reply; 51+ messages in thread
From: Will Deacon @ 2025-09-16 10:16 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Catalin Marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 16, 2025 at 11:02:19AM +0100, Yeoreum Yun wrote:
> Hi,
> > On Mon, Sep 15, 2025 at 09:35:55PM +0100, Will Deacon wrote:
> > > On Mon, Sep 15, 2025 at 08:40:33PM +0100, Catalin Marinas wrote:
> > > > On Mon, Sep 15, 2025 at 11:32:39AM +0100, Yeoreum Yun wrote:
> > > > > So I think it would be better to keep the current LLSC implementation
> > > > > in LSUI.
> > > >
> > > > I think the code would look simpler with LL/SC but you can give it a try
> > > > and post the code sample here (not in a new series).
> > >
> > > If you stick the cas*t instruction in its own helper say, cmpxchg_user(),
> > > then you can do all the shifting/masking in C and I don't reckon it's
> > > that bad. It means we (a) get rid of exclusives, which is the whole
> > > point of this and (b) don't have to mess around with PAN.
> >
> > We get rid of PAN toggling already since FEAT_LSUI introduces
> > LDTXR/STTXR. But, I'm all for CAS if it doesn't look too bad. Easier
> > I think if we do a get_user() of a u64 and combine it with the futex u32
> > while taking care of CPU endianness. All in a loop. Hopefully the
> > compiler is smart enough to reduce masking/or'ing to fewer instructions.
> >
> 
> Sorry for my wrong previous email.
> 
> Here is what i intened with CAS operation:
> 
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> index 1d6d9f856ac5..0aeda7ced2c0 100644
> --- a/arch/arm64/include/asm/futex.h
> +++ b/arch/arm64/include/asm/futex.h
> @@ -126,6 +126,60 @@ LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
>  LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
>  LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
> 
> +
> +#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
> +static __always_inline int                                                     \
> +__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
> +{                                                                              \
> +       int ret = 0;                                                            \
> +       u64 oval, nval, tmp;                                                    \
> +                                                                               \
> +       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
> +       __LSUI_PREAMBLE                                                         \
> +"      prfm    pstl1strm, %2\n"                                                \
> +"1:    ldtr    %x1, %2\n"                                                      \
> +"      mov     %x3, %x1\n"                                                     \
> +"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
> +"      bfi     %x3, %x6, #" #start_bit ", #32\n"                               \
> +"      mov     %x4, %x1\n"                                                     \
> +"2:    caslt   %x1, %x3, %2\n"                                                 \
> +"      sub     %x1, %x1, %x4\n"                                                \
> +"      cbz     %x1, 3f\n"                                                      \
> +"      mov     %w0, %w7\n"                                                     \
> +"3:\n"                                                                         \
> +"      dmb     ish\n"                                                          \
> +"4:\n"                                                                         \
> +       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
> +       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
> +       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
> +       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
> +       : "memory");                                                            \

The vast majority of this can be written in C.

Will


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 10:02             ` Yeoreum Yun
  2025-09-16 10:16               ` Will Deacon
@ 2025-09-16 12:47               ` Mark Rutland
  2025-09-16 13:27                 ` Yeoreum Yun
  1 sibling, 1 reply; 51+ messages in thread
From: Mark Rutland @ 2025-09-16 12:47 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Catalin Marinas, Will Deacon, broonie, maz, oliver.upton,
	joey.gouly, james.morse, ardb, scott, suzuki.poulose, yuzenghui,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 16, 2025 at 11:02:19AM +0100, Yeoreum Yun wrote:
> Hi,
> > On Mon, Sep 15, 2025 at 09:35:55PM +0100, Will Deacon wrote:
> > > On Mon, Sep 15, 2025 at 08:40:33PM +0100, Catalin Marinas wrote:
> > > > On Mon, Sep 15, 2025 at 11:32:39AM +0100, Yeoreum Yun wrote:
> > > > > So I think it would be better to keep the current LLSC implementation
> > > > > in LSUI.
> > > >
> > > > I think the code would look simpler with LL/SC but you can give it a try
> > > > and post the code sample here (not in a new series).
> > >
> > > If you stick the cas*t instruction in its own helper say, cmpxchg_user(),
> > > then you can do all the shifting/masking in C and I don't reckon it's
> > > that bad. It means we (a) get rid of exclusives, which is the whole
> > > point of this and (b) don't have to mess around with PAN.
> >
> > We get rid of PAN toggling already since FEAT_LSUI introduces
> > LDTXR/STTXR. But, I'm all for CAS if it doesn't look too bad. Easier
> > I think if we do a get_user() of a u64 and combine it with the futex u32
> > while taking care of CPU endianness. All in a loop. Hopefully the
> > compiler is smart enough to reduce masking/or'ing to fewer instructions.
> >
> 
> Sorry for my wrong previous email.
> 
> Here is what i intened with CAS operation:
> 
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> index 1d6d9f856ac5..0aeda7ced2c0 100644
> --- a/arch/arm64/include/asm/futex.h
> +++ b/arch/arm64/include/asm/futex.h
> @@ -126,6 +126,60 @@ LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
>  LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
>  LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
> 
> +
> +#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
> +static __always_inline int                                                     \
> +__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
> +{                                                                              \
> +       int ret = 0;                                                            \
> +       u64 oval, nval, tmp;                                                    \
> +                                                                               \
> +       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
> +       __LSUI_PREAMBLE                                                         \
> +"      prfm    pstl1strm, %2\n"                                                \
> +"1:    ldtr    %x1, %2\n"                                                      \
> +"      mov     %x3, %x1\n"                                                     \
> +"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
> +"      bfi     %x3, %x6, #" #start_bit ", #32\n"                               \
> +"      mov     %x4, %x1\n"                                                     \
> +"2:    caslt   %x1, %x3, %2\n"                                                 \
> +"      sub     %x1, %x1, %x4\n"                                                \
> +"      cbz     %x1, 3f\n"                                                      \
> +"      mov     %w0, %w7\n"                                                     \
> +"3:\n"                                                                         \
> +"      dmb     ish\n"                                                          \
> +"4:\n"                                                                         \
> +       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
> +       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
> +       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
> +       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
> +       : "memory");                                                            \
> +                                                                               \
> +       return ret;                                                             \
> +}
> +
> +LSUI_CMPXCHG_HELPER(lo, 0)
> +LSUI_CMPXCHG_HELPER(hi, 32)
> +
> +static __always_inline int
> +__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> +{
> +       int ret;
> +       unsigned long uaddr_al;
> +
> +       uaddr_al = ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
> +
> +       if (uaddr_al != (unsigned long)uaddr)
> +               ret = __lsui_cmpxchg_helper_hi((u64 __user *)uaddr_al, oldval, newval);
> +       else
> +               ret = __lsui_cmpxchg_helper_lo((u64 __user *)uaddr_al, oldval, newval);
> +
> +       if (!ret)
> +               *oval = oldval;
> +
> +       return ret;
> +}

I think Will expects that you do more of this in C, e.g. have a basic
user cmpxchg on a 64-bit type, e.g.

/*
 * NOTE: *oldp is NOT updated if a fault is taken.
 */
static __always_inline int
user_cmpxchg64_release(u64 __usr *addr, u64 *oldp, u64 new)
{
	int err = 0;

	asm volatile(
	__LSUI_PREAMBLE
	"1:	caslt	%x[old], %x[new], %[addr]\n"
	"2:\n"
	_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0) 
	: [addr] "+Q" (addr),
	  [old] "+r" (*oldp)
	: [new] "r" (new)
	: "memory"
	);

	return err;
}

That should be the *only* assembly you need to implement.

Atop that, have a wrapper that uses get_user() and that helper above to
implement the 32-bit user cmpxchg, with all the bit manipulation in C:

/*
 * NOTE: *oldp is NOT updated if a fault is taken.
 */
static int user_cmpxchg32_release(u32 __user *addr, u32 *oldp, u32 new)
{
	u64 __user *addr64 = PTR_ALIGN_DOWN(addr, sizeof(u64));
	u64 old64, new64;
	int ret;

	if (get_user(old64, uaddr64))
		return -EFAULT;

	/*
	 * TODO: merge '*oldp' into 'old64', and 'new' into 'new64',
	 * taking into account endianness and alignment
	 */

	ret = user_cmpxchg64_release(uaddr64, &old64, new64);
	if (ret)
		return ret;

	/*
	 * TODO: extract '*oldp' from 'old64', taking into account
	 * endianness and alignment.
	 */

	return 0;
}

Then use that 32-bit user cmpxchg for any cases which need it.

As the compiler will have visiblity of all of the bit manipulation, it
will hopefully be able to fold that together appropriately.

Mark.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 10:16               ` Will Deacon
@ 2025-09-16 12:50                 ` Yeoreum Yun
  2025-09-17  9:32                   ` Yeoreum Yun
  0 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-16 12:50 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi,

[...]
> > +#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
> > +static __always_inline int                                                     \
> > +__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
> > +{                                                                              \
> > +       int ret = 0;                                                            \
> > +       u64 oval, nval, tmp;                                                    \
> > +                                                                               \
> > +       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
> > +       __LSUI_PREAMBLE                                                         \
> > +"      prfm    pstl1strm, %2\n"                                                \
> > +"1:    ldtr    %x1, %2\n"                                                      \
> > +"      mov     %x3, %x1\n"                                                     \
> > +"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
> > +"      bfi     %x3, %x6, #" #start_bit ", #32\n"                               \
> > +"      mov     %x4, %x1\n"                                                     \
> > +"2:    caslt   %x1, %x3, %2\n"                                                 \
> > +"      sub     %x1, %x1, %x4\n"                                                \
> > +"      cbz     %x1, 3f\n"                                                      \
> > +"      mov     %w0, %w7\n"                                                     \
> > +"3:\n"                                                                         \
> > +"      dmb     ish\n"                                                          \
> > +"4:\n"                                                                         \
> > +       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
> > +       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
> > +       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
> > +       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
> > +       : "memory");                                                            \
>
> The vast majority of this can be written in C.

Here is the version with C base on patch 6:

diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
index 1d6d9f856ac5..68af15ba545a 100644
--- a/arch/arm64/include/asm/futex.h
+++ b/arch/arm64/include/asm/futex.h
@@ -127,81 +127,77 @@ LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
 LSUI_FUTEX_ATOMIC_OP(set, swpt, al)

 static __always_inline int
-__lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
+__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
 {
-       return __lsui_futex_atomic_andnot(~oparg, uaddr, oval);
-}
+       int ret = -EAGAIN;
+       u64 __user *uaddr_al;
+       u64 oval64, nval64, tmp;
+       static const u64 lo32_mask = GENMASK_U64(31, 0);
+
+       uaddr_al = (u64 __user *) ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
+       unsafe_get_user(oval64, uaddr_al, err_fault);
+
+       if ((u32 __user *)uaddr_al != uaddr) {
+               nval64 = ((oval64 & lo32_mask) | ((u64)newval << 32));
+               oval64 = ((oval64 & lo32_mask) | ((u64)oldval << 32));
+       } else {
+               nval64 = ((oval64 & ~lo32_mask) | newval);
+               oval64 = ((oval64 & ~lo32_mask) | oldval);
+       }

-static __always_inline int
-__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
-{
-       unsigned int loops = FUTEX_MAX_LOOPS;
-       int ret, oldval, tmp;
+       tmp = oval64;

-       /*
-        * there are no ldteor/stteor instructions...
-        */
-       asm volatile("// __lsui_futex_atomic_eor\n"
+       asm volatile("//__lsui_cmpxchg_helper\n"
        __LSUI_PREAMBLE
-"      prfm    pstl1strm, %2\n"
-"1:    ldtxr   %w1, %2\n"
-"      eor     %w3, %w1, %w5\n"
-"2:    stltxr  %w0, %w3, %2\n"
-"      cbz     %w0, 3f\n"
-"      sub     %w4, %w4, %w0\n"
-"      cbnz    %w4, 1b\n"
-"      mov     %w0, %w6\n"
-"3:\n"
+"1:    caslt   %x1, %x3, %2\n"
+"      sub     %x1, %x1, %x4\n"
+"      cbnz    %x1, 2f\n"
+"      mov     %w0, %w5\n"
+"2:\n"
 "      dmb     ish\n"
+"3:\n"
        _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
-       _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
-       : "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp),
-         "+r" (loops)
-       : "r" (oparg), "Ir" (-EAGAIN)
+       : "+r" (ret), "=&r" (oval64), "+Q" (*uaddr_al)
+       : "r" (nval64), "r" (tmp), "Ir" (0)
        : "memory");

        if (!ret)
                *oval = oldval;

+err_fault:
        return ret;
 }

 static __always_inline int
-__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
+__lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
 {
-       int ret = 0;
-       unsigned int loops = FUTEX_MAX_LOOPS;
-       u32 val, tmp;
+       return __lsui_futex_atomic_andnot(~oparg, uaddr, oval);
+}
+
+static __always_inline int
+__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
+{
+{
+       int ret = -EAGAIN;
+       u32 oldval, newval;

        /*
-        * cas{al}t doesn't support word size...
+        * there are no ldteor/stteor instructions...
         */
-       asm volatile("//__lsui_futex_cmpxchg\n"
-       __LSUI_PREAMBLE
-"      prfm    pstl1strm, %2\n"
-"1:    ldtxr   %w1, %2\n"
-"      eor     %w3, %w1, %w5\n"
-"      cbnz    %w3, 4f\n"
-"2:    stltxr  %w3, %w6, %2\n"
-"      cbz     %w3, 3f\n"
-"      sub     %w4, %w4, %w3\n"
-"      cbnz    %w4, 1b\n"
-"      mov     %w0, %w7\n"
-"3:\n"
-"      dmb     ish\n"
-"4:\n"
-       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
-       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
-       : "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
-       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
-       : "memory");
+       unsafe_get_user(oldval, uaddr, err_fault);
+       newval = oldval ^ oparg;

-       if (!ret)
-               *oval = oldval;
+       ret = __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);

+err_fault:
        return ret;
 }

+static __always_inline int
+__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
+{
+       return __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);
+}
+
 #define __lsui_llsc_body(op, ...)                                      \
 ({                                                                     \
        alternative_has_cap_likely(ARM64_HAS_LSUI) ?                    \
(END)

I'm not sure this is good for you.
But If you share your thought, That's would be greatful.
(Note:
  When I test with 256 threads for futex_atomic_eor op, there is not much
  difference with former assembly version)

Thanks!

--
Sincerely,
Yeoreum Yun


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-15 22:34         ` Yeoreum Yun
@ 2025-09-16 12:53           ` Catalin Marinas
  0 siblings, 0 replies; 51+ messages in thread
From: Catalin Marinas @ 2025-09-16 12:53 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: will, broonie, maz, oliver.upton, joey.gouly, james.morse, ardb,
	scott, suzuki.poulose, yuzenghui, mark.rutland, linux-arm-kernel,
	kvmarm, linux-kernel

On Mon, Sep 15, 2025 at 11:34:07PM +0100, Yeoreum Yun wrote:
> with selftest's futex testcase, I've tested.
> But Since there is no test for each operation even in LTP,
> I tested it with additional test from me.

As a separate series, you may improve the futex kselftest to cover all
the cases.

-- 
Catalin


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 12:47               ` Mark Rutland
@ 2025-09-16 13:27                 ` Yeoreum Yun
  2025-09-16 13:45                   ` Mark Rutland
  0 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-16 13:27 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, broonie, maz, oliver.upton,
	joey.gouly, james.morse, ardb, scott, suzuki.poulose, yuzenghui,
	linux-arm-kernel, kvmarm, linux-kernel

Hi Mark,

[...]
> > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > index 1d6d9f856ac5..0aeda7ced2c0 100644
> > --- a/arch/arm64/include/asm/futex.h
> > +++ b/arch/arm64/include/asm/futex.h
> > @@ -126,6 +126,60 @@ LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
> >  LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
> >  LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
> >
> > +
> > +#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
> > +static __always_inline int                                                     \
> > +__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
> > +{                                                                              \
> > +       int ret = 0;                                                            \
> > +       u64 oval, nval, tmp;                                                    \
> > +                                                                               \
> > +       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
> > +       __LSUI_PREAMBLE                                                         \
> > +"      prfm    pstl1strm, %2\n"                                                \
> > +"1:    ldtr    %x1, %2\n"                                                      \
> > +"      mov     %x3, %x1\n"                                                     \
> > +"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
> > +"      bfi     %x3, %x6, #" #start_bit ", #32\n"                               \
> > +"      mov     %x4, %x1\n"                                                     \
> > +"2:    caslt   %x1, %x3, %2\n"                                                 \
> > +"      sub     %x1, %x1, %x4\n"                                                \
> > +"      cbz     %x1, 3f\n"                                                      \
> > +"      mov     %w0, %w7\n"                                                     \
> > +"3:\n"                                                                         \
> > +"      dmb     ish\n"                                                          \
> > +"4:\n"                                                                         \
> > +       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
> > +       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
> > +       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
> > +       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
> > +       : "memory");                                                            \
> > +                                                                               \
> > +       return ret;                                                             \
> > +}
> > +
> > +LSUI_CMPXCHG_HELPER(lo, 0)
> > +LSUI_CMPXCHG_HELPER(hi, 32)
> > +
> > +static __always_inline int
> > +__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> > +{
> > +       int ret;
> > +       unsigned long uaddr_al;
> > +
> > +       uaddr_al = ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
> > +
> > +       if (uaddr_al != (unsigned long)uaddr)
> > +               ret = __lsui_cmpxchg_helper_hi((u64 __user *)uaddr_al, oldval, newval);
> > +       else
> > +               ret = __lsui_cmpxchg_helper_lo((u64 __user *)uaddr_al, oldval, newval);
> > +
> > +       if (!ret)
> > +               *oval = oldval;
> > +
> > +       return ret;
> > +}
>
> I think Will expects that you do more of this in C, e.g. have a basic
> user cmpxchg on a 64-bit type, e.g.
>
> /*
>  * NOTE: *oldp is NOT updated if a fault is taken.
>  */
> static __always_inline int
> user_cmpxchg64_release(u64 __usr *addr, u64 *oldp, u64 new)
> {
> 	int err = 0;
>
> 	asm volatile(
> 	__LSUI_PREAMBLE
> 	"1:	caslt	%x[old], %x[new], %[addr]\n"
> 	"2:\n"
> 	_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
> 	: [addr] "+Q" (addr),
> 	  [old] "+r" (*oldp)
> 	: [new] "r" (new)
> 	: "memory"
> 	);
>
> 	return err;
> }
>
> That should be the *only* assembly you need to implement.
>
> Atop that, have a wrapper that uses get_user() and that helper above to
> implement the 32-bit user cmpxchg, with all the bit manipulation in C:

Thanks for your suggestion. But small question.
I think it's enough to use usafe_get_user() instead of get_user() in here
since when FEAT_LSUI enabled, it doeesn't need to call
uaccess_ttbr0_enable()/disable().

or Am I missing something?

>
> /*
>  * NOTE: *oldp is NOT updated if a fault is taken.
>  */
> static int user_cmpxchg32_release(u32 __user *addr, u32 *oldp, u32 new)
> {
> 	u64 __user *addr64 = PTR_ALIGN_DOWN(addr, sizeof(u64));
> 	u64 old64, new64;
> 	int ret;
>
> 	if (get_user(old64, uaddr64))
> 		return -EFAULT;
>
> 	/*
> 	 * TODO: merge '*oldp' into 'old64', and 'new' into 'new64',
> 	 * taking into account endianness and alignment
> 	 */
>
> 	ret = user_cmpxchg64_release(uaddr64, &old64, new64);
> 	if (ret)
> 		return ret;
>
> 	/*
> 	 * TODO: extract '*oldp' from 'old64', taking into account
> 	 * endianness and alignment.
> 	 */

I think extraction doesn't need since futext_cmpxchg op passes the
"oldval" argument which fetched and passed already.
So when it success, it seems enough to return the passed arguments.
But, I'm not sure which is better whether dividing the cmpxchg64 and
cmpxchge32 interface or make one helper -- user_cmpxchg_release
(__lsui_cmpxchg_helper).

[...]

Thanks.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 13:27                 ` Yeoreum Yun
@ 2025-09-16 13:45                   ` Mark Rutland
  2025-09-16 13:58                     ` Yeoreum Yun
  0 siblings, 1 reply; 51+ messages in thread
From: Mark Rutland @ 2025-09-16 13:45 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Catalin Marinas, Will Deacon, broonie, maz, oliver.upton,
	joey.gouly, james.morse, ardb, scott, suzuki.poulose, yuzenghui,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 16, 2025 at 02:27:37PM +0100, Yeoreum Yun wrote:
> Hi Mark,
> 
> [...]
> > > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > > index 1d6d9f856ac5..0aeda7ced2c0 100644
> > > --- a/arch/arm64/include/asm/futex.h
> > > +++ b/arch/arm64/include/asm/futex.h
> > > @@ -126,6 +126,60 @@ LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
> > >  LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
> > >  LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
> > >
> > > +
> > > +#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
> > > +static __always_inline int                                                     \
> > > +__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
> > > +{                                                                              \
> > > +       int ret = 0;                                                            \
> > > +       u64 oval, nval, tmp;                                                    \
> > > +                                                                               \
> > > +       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
> > > +       __LSUI_PREAMBLE                                                         \
> > > +"      prfm    pstl1strm, %2\n"                                                \
> > > +"1:    ldtr    %x1, %2\n"                                                      \
> > > +"      mov     %x3, %x1\n"                                                     \
> > > +"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
> > > +"      bfi     %x3, %x6, #" #start_bit ", #32\n"                               \
> > > +"      mov     %x4, %x1\n"                                                     \
> > > +"2:    caslt   %x1, %x3, %2\n"                                                 \
> > > +"      sub     %x1, %x1, %x4\n"                                                \
> > > +"      cbz     %x1, 3f\n"                                                      \
> > > +"      mov     %w0, %w7\n"                                                     \
> > > +"3:\n"                                                                         \
> > > +"      dmb     ish\n"                                                          \
> > > +"4:\n"                                                                         \
> > > +       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
> > > +       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
> > > +       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
> > > +       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
> > > +       : "memory");                                                            \
> > > +                                                                               \
> > > +       return ret;                                                             \
> > > +}
> > > +
> > > +LSUI_CMPXCHG_HELPER(lo, 0)
> > > +LSUI_CMPXCHG_HELPER(hi, 32)
> > > +
> > > +static __always_inline int
> > > +__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> > > +{
> > > +       int ret;
> > > +       unsigned long uaddr_al;
> > > +
> > > +       uaddr_al = ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
> > > +
> > > +       if (uaddr_al != (unsigned long)uaddr)
> > > +               ret = __lsui_cmpxchg_helper_hi((u64 __user *)uaddr_al, oldval, newval);
> > > +       else
> > > +               ret = __lsui_cmpxchg_helper_lo((u64 __user *)uaddr_al, oldval, newval);
> > > +
> > > +       if (!ret)
> > > +               *oval = oldval;
> > > +
> > > +       return ret;
> > > +}
> >
> > I think Will expects that you do more of this in C, e.g. have a basic
> > user cmpxchg on a 64-bit type, e.g.
> >
> > /*
> >  * NOTE: *oldp is NOT updated if a fault is taken.
> >  */
> > static __always_inline int
> > user_cmpxchg64_release(u64 __usr *addr, u64 *oldp, u64 new)
> > {
> > 	int err = 0;
> >
> > 	asm volatile(
> > 	__LSUI_PREAMBLE
> > 	"1:	caslt	%x[old], %x[new], %[addr]\n"
> > 	"2:\n"
> > 	_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
> > 	: [addr] "+Q" (addr),
> > 	  [old] "+r" (*oldp)
> > 	: [new] "r" (new)
> > 	: "memory"
> > 	);
> >
> > 	return err;
> > }
> >
> > That should be the *only* assembly you need to implement.
> >
> > Atop that, have a wrapper that uses get_user() and that helper above to
> > implement the 32-bit user cmpxchg, with all the bit manipulation in C:
> 
> Thanks for your suggestion. But small question.
> I think it's enough to use usafe_get_user() instead of get_user() in here
> since when FEAT_LSUI enabled, it doeesn't need to call
> uaccess_ttbr0_enable()/disable().

Regardless of uaccess_ttbr0_enable() and uaccess_ttbr0_disable()
specifically, API-wise unsafe_get_user() is only supposed to be called
between user_access_begin() and user_access_end(), and there's some
stuff we probably want to add there (e.g. might_fault(), which
unsafe_get_user() lacks today).

Do we call those?

Mark.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 13:45                   ` Mark Rutland
@ 2025-09-16 13:58                     ` Yeoreum Yun
  2025-09-16 14:07                       ` Mark Rutland
  0 siblings, 1 reply; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-16 13:58 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, broonie, maz, oliver.upton,
	joey.gouly, james.morse, ardb, scott, suzuki.poulose, yuzenghui,
	linux-arm-kernel, kvmarm, linux-kernel

Hi Mark,

[...]
> > > > diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> > > > index 1d6d9f856ac5..0aeda7ced2c0 100644
> > > > --- a/arch/arm64/include/asm/futex.h
> > > > +++ b/arch/arm64/include/asm/futex.h
> > > > @@ -126,6 +126,60 @@ LSUI_FUTEX_ATOMIC_OP(or, ldtset, al)
> > > >  LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
> > > >  LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
> > > >
> > > > +
> > > > +#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
> > > > +static __always_inline int                                                     \
> > > > +__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
> > > > +{                                                                              \
> > > > +       int ret = 0;                                                            \
> > > > +       u64 oval, nval, tmp;                                                    \
> > > > +                                                                               \
> > > > +       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
> > > > +       __LSUI_PREAMBLE                                                         \
> > > > +"      prfm    pstl1strm, %2\n"                                                \
> > > > +"1:    ldtr    %x1, %2\n"                                                      \
> > > > +"      mov     %x3, %x1\n"                                                     \
> > > > +"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
> > > > +"      bfi     %x3, %x6, #" #start_bit ", #32\n"                               \
> > > > +"      mov     %x4, %x1\n"                                                     \
> > > > +"2:    caslt   %x1, %x3, %2\n"                                                 \
> > > > +"      sub     %x1, %x1, %x4\n"                                                \
> > > > +"      cbz     %x1, 3f\n"                                                      \
> > > > +"      mov     %w0, %w7\n"                                                     \
> > > > +"3:\n"                                                                         \
> > > > +"      dmb     ish\n"                                                          \
> > > > +"4:\n"                                                                         \
> > > > +       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
> > > > +       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
> > > > +       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
> > > > +       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
> > > > +       : "memory");                                                            \
> > > > +                                                                               \
> > > > +       return ret;                                                             \
> > > > +}
> > > > +
> > > > +LSUI_CMPXCHG_HELPER(lo, 0)
> > > > +LSUI_CMPXCHG_HELPER(hi, 32)
> > > > +
> > > > +static __always_inline int
> > > > +__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> > > > +{
> > > > +       int ret;
> > > > +       unsigned long uaddr_al;
> > > > +
> > > > +       uaddr_al = ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
> > > > +
> > > > +       if (uaddr_al != (unsigned long)uaddr)
> > > > +               ret = __lsui_cmpxchg_helper_hi((u64 __user *)uaddr_al, oldval, newval);
> > > > +       else
> > > > +               ret = __lsui_cmpxchg_helper_lo((u64 __user *)uaddr_al, oldval, newval);
> > > > +
> > > > +       if (!ret)
> > > > +               *oval = oldval;
> > > > +
> > > > +       return ret;
> > > > +}
> > >
> > > I think Will expects that you do more of this in C, e.g. have a basic
> > > user cmpxchg on a 64-bit type, e.g.
> > >
> > > /*
> > >  * NOTE: *oldp is NOT updated if a fault is taken.
> > >  */
> > > static __always_inline int
> > > user_cmpxchg64_release(u64 __usr *addr, u64 *oldp, u64 new)
> > > {
> > > 	int err = 0;
> > >
> > > 	asm volatile(
> > > 	__LSUI_PREAMBLE
> > > 	"1:	caslt	%x[old], %x[new], %[addr]\n"
> > > 	"2:\n"
> > > 	_ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
> > > 	: [addr] "+Q" (addr),
> > > 	  [old] "+r" (*oldp)
> > > 	: [new] "r" (new)
> > > 	: "memory"
> > > 	);
> > >
> > > 	return err;
> > > }
> > >
> > > That should be the *only* assembly you need to implement.
> > >
> > > Atop that, have a wrapper that uses get_user() and that helper above to
> > > implement the 32-bit user cmpxchg, with all the bit manipulation in C:
> >
> > Thanks for your suggestion. But small question.
> > I think it's enough to use usafe_get_user() instead of get_user() in here
> > since when FEAT_LSUI enabled, it doeesn't need to call
> > uaccess_ttbr0_enable()/disable().
>
> Regardless of uaccess_ttbr0_enable() and uaccess_ttbr0_disable()
> specifically, API-wise unsafe_get_user() is only supposed to be called
> between user_access_begin() and user_access_end(), and there's some
> stuff we probably want to add there (e.g. might_fault(), which
> unsafe_get_user() lacks today).
>
> Do we call those?

Yes when you're available.
As you mention, the difference seems might_fault(),
But I'm not sure whether that would be a reason to validate to use
get_user() instead of unsafe_get_user() taking a increase of instruction
of "nop" -- uaccess_ttbr0_enable()/disable() in LSUI
except the reason for DEUBG purpose.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 13:58                     ` Yeoreum Yun
@ 2025-09-16 14:07                       ` Mark Rutland
  2025-09-16 14:15                         ` Yeoreum Yun
  0 siblings, 1 reply; 51+ messages in thread
From: Mark Rutland @ 2025-09-16 14:07 UTC (permalink / raw)
  To: Yeoreum Yun
  Cc: Catalin Marinas, Will Deacon, broonie, maz, oliver.upton,
	joey.gouly, james.morse, ardb, scott, suzuki.poulose, yuzenghui,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 16, 2025 at 02:58:16PM +0100, Yeoreum Yun wrote:
> Hi Mark,
> 
> [...]
> > > I think it's enough to use usafe_get_user() instead of get_user() in here
> > > since when FEAT_LSUI enabled, it doeesn't need to call
> > > uaccess_ttbr0_enable()/disable().
> >
> > Regardless of uaccess_ttbr0_enable() and uaccess_ttbr0_disable()
> > specifically, API-wise unsafe_get_user() is only supposed to be called
> > between user_access_begin() and user_access_end(), and there's some
> > stuff we probably want to add there (e.g. might_fault(), which
> > unsafe_get_user() lacks today).
> >
> > Do we call those?
> 
> Yes when you're available.
> As you mention, the difference seems might_fault(),
> But I'm not sure whether that would be a reason to validate to use
> get_user() instead of unsafe_get_user() taking a increase of instruction
> of "nop" -- uaccess_ttbr0_enable()/disable() in LSUI
> except the reason for DEUBG purpose.

I think the practical impact of those NOPs is going to be neglible, and
not worth optimizing for unless/until we have data demonstrating
otherwise.

If we want to strictly avoid those NOPs, I think that we should do a
more general cleanup, and e.g. have variants of user_access_begin() and
user_access_end() that do not mess with TTBR0. I don't think we need to
do that for this series.

For now, I think that you should either:

* Use get_user().

* Use user_access_begin() .. user_access_end() wrapping both
  unsafe_get_user() and the user cmpxchg.

Thanks,
Mark.


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 14:07                       ` Mark Rutland
@ 2025-09-16 14:15                         ` Yeoreum Yun
  0 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-16 14:15 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Catalin Marinas, Will Deacon, broonie, maz, oliver.upton,
	joey.gouly, james.morse, ardb, scott, suzuki.poulose, yuzenghui,
	linux-arm-kernel, kvmarm, linux-kernel

On Tue, Sep 16, 2025 at 03:07:09PM +0100, Mark Rutland wrote:
> On Tue, Sep 16, 2025 at 02:58:16PM +0100, Yeoreum Yun wrote:
> > Hi Mark,
> >
> > [...]
> > > > I think it's enough to use usafe_get_user() instead of get_user() in here
> > > > since when FEAT_LSUI enabled, it doeesn't need to call
> > > > uaccess_ttbr0_enable()/disable().
> > >
> > > Regardless of uaccess_ttbr0_enable() and uaccess_ttbr0_disable()
> > > specifically, API-wise unsafe_get_user() is only supposed to be called
> > > between user_access_begin() and user_access_end(), and there's some
> > > stuff we probably want to add there (e.g. might_fault(), which
> > > unsafe_get_user() lacks today).
> > >
> > > Do we call those?
> >
> > Yes when you're available.
> > As you mention, the difference seems might_fault(),
> > But I'm not sure whether that would be a reason to validate to use
> > get_user() instead of unsafe_get_user() taking a increase of instruction
> > of "nop" -- uaccess_ttbr0_enable()/disable() in LSUI
> > except the reason for DEUBG purpose.
>
> I think the practical impact of those NOPs is going to be neglible, and
> not worth optimizing for unless/until we have data demonstrating
> otherwise.
>
> If we want to strictly avoid those NOPs, I think that we should do a
> more general cleanup, and e.g. have variants of user_access_begin() and
> user_access_end() that do not mess with TTBR0. I don't think we need to
> do that for this series.
>
> For now, I think that you should either:
>
> * Use get_user().
>
> * Use user_access_begin() .. user_access_end() wrapping both
>   unsafe_get_user() and the user cmpxchg.

Understood. I'll use get_user() in this series after
getting some more comments from other about C version which I sent
before your suggestion (Sorry to miss your email before I sent).

Thanks :D

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation
  2025-09-16 12:50                 ` Yeoreum Yun
@ 2025-09-17  9:32                   ` Yeoreum Yun
  0 siblings, 0 replies; 51+ messages in thread
From: Yeoreum Yun @ 2025-09-17  9:32 UTC (permalink / raw)
  To: Will Deacon
  Cc: Catalin Marinas, broonie, maz, oliver.upton, joey.gouly,
	james.morse, ardb, scott, suzuki.poulose, yuzenghui, mark.rutland,
	linux-arm-kernel, kvmarm, linux-kernel

Hi all,

> Hi,
>
> [...]
> > > +#define LSUI_CMPXCHG_HELPER(suffix, start_bit)                                 \
> > > +static __always_inline int                                                     \
> > > +__lsui_cmpxchg_helper_##suffix(u64 __user *uaddr, u32 oldval, u32 newval)      \
> > > +{                                                                              \
> > > +       int ret = 0;                                                            \
> > > +       u64 oval, nval, tmp;                                                    \
> > > +                                                                               \
> > > +       asm volatile("//__lsui_cmpxchg_helper_" #suffix "\n"                    \
> > > +       __LSUI_PREAMBLE                                                         \
> > > +"      prfm    pstl1strm, %2\n"                                                \
> > > +"1:    ldtr    %x1, %2\n"                                                      \
> > > +"      mov     %x3, %x1\n"                                                     \
> > > +"      bfi     %x1, %x5, #" #start_bit ", #32\n"                               \
> > > +"      bfi     %x3, %x6, #" #start_bit ", #32\n"                               \
> > > +"      mov     %x4, %x1\n"                                                     \
> > > +"2:    caslt   %x1, %x3, %2\n"                                                 \
> > > +"      sub     %x1, %x1, %x4\n"                                                \
> > > +"      cbz     %x1, 3f\n"                                                      \
> > > +"      mov     %w0, %w7\n"                                                     \
> > > +"3:\n"                                                                         \
> > > +"      dmb     ish\n"                                                          \
> > > +"4:\n"                                                                         \
> > > +       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)                                   \
> > > +       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)                                   \
> > > +       : "+r" (ret), "=&r" (oval), "+Q" (*uaddr), "=&r" (nval), "=&r" (tmp)    \
> > > +       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)                            \
> > > +       : "memory");                                                            \
> >
> > The vast majority of this can be written in C.
>
> Here is the version with C base on patch 6:
>
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> diff --git a/arch/arm64/include/asm/futex.h b/arch/arm64/include/asm/futex.h
> index 1d6d9f856ac5..68af15ba545a 100644
> --- a/arch/arm64/include/asm/futex.h
> +++ b/arch/arm64/include/asm/futex.h
> @@ -127,81 +127,77 @@ LSUI_FUTEX_ATOMIC_OP(andnot, ldtclr, al)
>  LSUI_FUTEX_ATOMIC_OP(set, swpt, al)
>
>  static __always_inline int
> -__lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
> +__lsui_cmpxchg_helper(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
>  {
> -       return __lsui_futex_atomic_andnot(~oparg, uaddr, oval);
> -}
> +       int ret = -EAGAIN;
> +       u64 __user *uaddr_al;
> +       u64 oval64, nval64, tmp;
> +       static const u64 lo32_mask = GENMASK_U64(31, 0);
> +
> +       uaddr_al = (u64 __user *) ALIGN_DOWN((unsigned long)uaddr, sizeof(u64));
> +       unsafe_get_user(oval64, uaddr_al, err_fault);
> +
> +       if ((u32 __user *)uaddr_al != uaddr) {
> +               nval64 = ((oval64 & lo32_mask) | ((u64)newval << 32));
> +               oval64 = ((oval64 & lo32_mask) | ((u64)oldval << 32));
> +       } else {
> +               nval64 = ((oval64 & ~lo32_mask) | newval);
> +               oval64 = ((oval64 & ~lo32_mask) | oldval);
> +       }
>
> -static __always_inline int
> -__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
> -{
> -       unsigned int loops = FUTEX_MAX_LOOPS;
> -       int ret, oldval, tmp;
> +       tmp = oval64;
>
> -       /*
> -        * there are no ldteor/stteor instructions...
> -        */
> -       asm volatile("// __lsui_futex_atomic_eor\n"
> +       asm volatile("//__lsui_cmpxchg_helper\n"
>         __LSUI_PREAMBLE
> -"      prfm    pstl1strm, %2\n"
> -"1:    ldtxr   %w1, %2\n"
> -"      eor     %w3, %w1, %w5\n"
> -"2:    stltxr  %w0, %w3, %2\n"
> -"      cbz     %w0, 3f\n"
> -"      sub     %w4, %w4, %w0\n"
> -"      cbnz    %w4, 1b\n"
> -"      mov     %w0, %w6\n"
> -"3:\n"
> +"1:    caslt   %x1, %x3, %2\n"
> +"      sub     %x1, %x1, %x4\n"
> +"      cbnz    %x1, 2f\n"
> +"      mov     %w0, %w5\n"
> +"2:\n"
>  "      dmb     ish\n"
> +"3:\n"
>         _ASM_EXTABLE_UACCESS_ERR(1b, 3b, %w0)
> -       _ASM_EXTABLE_UACCESS_ERR(2b, 3b, %w0)
> -       : "=&r" (ret), "=&r" (oldval), "+Q" (*uaddr), "=&r" (tmp),
> -         "+r" (loops)
> -       : "r" (oparg), "Ir" (-EAGAIN)
> +       : "+r" (ret), "=&r" (oval64), "+Q" (*uaddr_al)
> +       : "r" (nval64), "r" (tmp), "Ir" (0)
>         : "memory");
>
>         if (!ret)
>                 *oval = oldval;
>
> +err_fault:
>         return ret;
>  }
>
>  static __always_inline int
> -__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> +__lsui_futex_atomic_and(int oparg, u32 __user *uaddr, int *oval)
>  {
> -       int ret = 0;
> -       unsigned int loops = FUTEX_MAX_LOOPS;
> -       u32 val, tmp;
> +       return __lsui_futex_atomic_andnot(~oparg, uaddr, oval);
> +}
> +
> +static __always_inline int
> +__lsui_futex_atomic_eor(int oparg, u32 __user *uaddr, int *oval)
> +{
> +{
> +       int ret = -EAGAIN;
> +       u32 oldval, newval;
>
>         /*
> -        * cas{al}t doesn't support word size...
> +        * there are no ldteor/stteor instructions...
>          */
> -       asm volatile("//__lsui_futex_cmpxchg\n"
> -       __LSUI_PREAMBLE
> -"      prfm    pstl1strm, %2\n"
> -"1:    ldtxr   %w1, %2\n"
> -"      eor     %w3, %w1, %w5\n"
> -"      cbnz    %w3, 4f\n"
> -"2:    stltxr  %w3, %w6, %2\n"
> -"      cbz     %w3, 3f\n"
> -"      sub     %w4, %w4, %w3\n"
> -"      cbnz    %w4, 1b\n"
> -"      mov     %w0, %w7\n"
> -"3:\n"
> -"      dmb     ish\n"
> -"4:\n"
> -       _ASM_EXTABLE_UACCESS_ERR(1b, 4b, %w0)
> -       _ASM_EXTABLE_UACCESS_ERR(2b, 4b, %w0)
> -       : "+r" (ret), "=&r" (val), "+Q" (*uaddr), "=&r" (tmp), "+r" (loops)
> -       : "r" (oldval), "r" (newval), "Ir" (-EAGAIN)
> -       : "memory");
> +       unsafe_get_user(oldval, uaddr, err_fault);
> +       newval = oldval ^ oparg;
>
> -       if (!ret)
> -               *oval = oldval;
> +       ret = __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);
>
> +err_fault:
>         return ret;
>  }
>
> +static __always_inline int
> +__lsui_futex_cmpxchg(u32 __user *uaddr, u32 oldval, u32 newval, u32 *oval)
> +{
> +       return __lsui_cmpxchg_helper(uaddr, oldval, newval, oval);
> +}
> +
>  #define __lsui_llsc_body(op, ...)                                      \
>  ({                                                                     \
>         alternative_has_cap_likely(ARM64_HAS_LSUI) ?                    \
> (END)
>
> I'm not sure this is good for you.
> But If you share your thought, That's would be greatful.
> (Note:
>   When I test with 256 threads for futex_atomic_eor op, there is not much
>   difference with former assembly version)
>

Might be all discussion seems to be going to made cmpxchg with C version.
I'll respin cmpxchg with C version with the missing endianess support.

Thanks.

--
Sincerely,
Yeoreum Yun


^ permalink raw reply	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2025-09-17  9:33 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-16 15:19 [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
2025-08-16 15:19 ` [PATCH RESEND v7 1/6] arm64: cpufeature: add FEAT_LSUI Yeoreum Yun
2025-09-12 16:12   ` Catalin Marinas
2025-08-16 15:19 ` [PATCH RESEND v7 2/6] KVM: arm64: expose FEAT_LSUI to guest Yeoreum Yun
2025-09-12 16:25   ` Catalin Marinas
2025-08-16 15:19 ` [PATCH RESEND v7 3/6] arm64: Kconfig: add LSUI Kconfig Yeoreum Yun
2025-09-12 16:24   ` Catalin Marinas
2025-09-15 10:42     ` Yeoreum Yun
2025-09-15 11:32       ` Will Deacon
2025-09-15 11:41         ` Yeoreum Yun
2025-08-16 15:19 ` [PATCH RESEND v7 4/6] arm64: futex: refactor futex atomic operation Yeoreum Yun
2025-09-11 15:38   ` Will Deacon
2025-09-11 16:04     ` Yeoreum Yun
2025-09-12 16:44   ` Catalin Marinas
2025-09-12 17:01     ` Catalin Marinas
2025-09-15 10:39     ` Yeoreum Yun
2025-09-12 16:53   ` Catalin Marinas
2025-09-15 10:32     ` Yeoreum Yun
2025-09-15 19:40       ` Catalin Marinas
2025-09-15 20:35         ` Will Deacon
2025-09-16  7:02           ` Catalin Marinas
2025-09-16  9:15             ` Yeoreum Yun
2025-09-16  9:24               ` Yeoreum Yun
2025-09-16 10:02             ` Yeoreum Yun
2025-09-16 10:16               ` Will Deacon
2025-09-16 12:50                 ` Yeoreum Yun
2025-09-17  9:32                   ` Yeoreum Yun
2025-09-16 12:47               ` Mark Rutland
2025-09-16 13:27                 ` Yeoreum Yun
2025-09-16 13:45                   ` Mark Rutland
2025-09-16 13:58                     ` Yeoreum Yun
2025-09-16 14:07                       ` Mark Rutland
2025-09-16 14:15                         ` Yeoreum Yun
2025-09-15 22:34         ` Yeoreum Yun
2025-09-16 12:53           ` Catalin Marinas
2025-08-16 15:19 ` [PATCH v7 RESEND 5/6] arm64: futex: small optimisation for __llsc_futex_atomic_set() Yeoreum Yun
2025-09-11 15:28   ` Will Deacon
2025-09-11 16:19     ` Yeoreum Yun
2025-09-12 16:36       ` Catalin Marinas
2025-09-15 10:41         ` Yeoreum Yun
2025-08-16 15:19 ` [PATCH RESEND v7 6/6] arm64: futex: support futex with FEAT_LSUI Yeoreum Yun
2025-09-11 15:22   ` Will Deacon
2025-09-11 16:45     ` Yeoreum Yun
2025-09-12 17:16     ` Catalin Marinas
2025-09-15  9:15       ` Yeoreum Yun
2025-09-12 17:09   ` Catalin Marinas
2025-09-15  8:24     ` Yeoreum Yun
2025-09-01 10:06 ` [PATCH RESEND v7 0/6] support FEAT_LSUI and apply it on futex atomic ops Yeoreum Yun
2025-09-11 15:09 ` Will Deacon
2025-09-11 16:22   ` Catalin Marinas
2025-09-15 20:37     ` Will Deacon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).