[PATCH v3 0/9] s390: Improve this

Linux s390 Architecture development
 help / color / mirror / Atom feed

* [PATCH v3 0/9] s390: Improve this_cpu operations
@ 2026-05-20  9:22 Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
                   ` (9 more replies)
  0 siblings, 10 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

v3:
- Fix various typos [Juergen Christ]

- Add missing kprobe detection / handling [Sashiko [3]]
  [FWIW, this made me also aware of that the current general s390 kprobes
   code seems to be racy against concurrent removal of a kprobe while a
   probe hit on a different CPU. But that is a different story.]

- Fix various minor findings [Sashiko [3]]

- All of this might be dropped / exchanged in future in favor of the percpu
  page table approach proposed by Yang Shi [4].

[3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
[4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/

v2:

- Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
  warnings

- Add missing __packed attribute to insn structure [Sashiko [2]]

- Fix inverted if condition [Sashiko [2]]

- Add missing user_mode() check [Sashiko [2]]

- Move percpu_entry() call in front of irqentry_enter() call in all
  entry paths to avoid that potential this_cpu() operations overwrite
  the not-yet saved percpu code section indicator  [Sashiko [2]]

[2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com

v1:

This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

To avoid this Peter suggested an in-kernel rseq approach. While this would
certainly work, this series tries to come up with a solution which uses
less instructions and doesn't require to repeat instruction sequences.

The idea is that this_cpu operations based on atomic instructions are
guarded with mvyi instructions:

- The first mvyi instruction writes the register number, which contains
  the percpu address variable to lowcore. This also indicates that a
  percpu code section is executed.

- The first instruction following the mvyi instruction must be the ag
  instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mvyi instruction writes a zero to lowcore, which indicates
  the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
  written to lowcore is copied to the exception frame (pt_regs), and a zero
  is written to lowcore.

- On return to the previous context it is checked if a percpu code section
  was executed (saved register number not zero), and if the process was
  migrated to a different cpu. If the percpu offset was already added to
  the percpu address register (instruction address does _not_ point to the
  ag instruction) the content of the percpu address register is adjusted so
  it points to percpu variable of the new cpu.

All of this seems to work, but of course it could still be broken since I
missed some detail.

In total this series results in a kernel text size reduction of ~106kb. The
number of preempt_schedule_notrace() call sites is reduced from 7089 to
1577.

Note: this comes without any huge performance analysis, however all
microbenchmarks confirmed that the new code is at least as fast as the
old code, like expected.

[1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net

Heiko Carstens (9):
  s390/alternatives: Add new ALT_TYPE_PERCPU type
  s390/percpu: Infrastructure for more efficient this_cpu operations
  s390/percpu: Add missing do { } while (0) constructs
  s390/percpu: Use new percpu code section for arch_this_cpu_add()
  s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
  s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
  s390/percpu: Provide arch_this_cpu_read() implementation
  s390/percpu: Provide arch_this_cpu_write() implementation
  s390/percpu: Remove one and two byte this_cpu operation implementation

 arch/s390/boot/alternative.c         |   7 +
 arch/s390/include/asm/alternative.h  |   5 +
 arch/s390/include/asm/entry-percpu.h |  76 ++++++++
 arch/s390/include/asm/lowcore.h      |   3 +-
 arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
 arch/s390/include/asm/ptrace.h       |   2 +
 arch/s390/kernel/alternative.c       |  25 ++-
 arch/s390/kernel/irq.c               |  26 ++-
 arch/s390/kernel/nmi.c               |   6 +
 arch/s390/kernel/traps.c             |   6 +
 10 files changed, 344 insertions(+), 61 deletions(-)
 create mode 100644 arch/s390/include/asm/entry-percpu.h

base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
-- 
2.51.0


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20 12:43   ` David Laight
  2026-05-20  9:22 ` [PATCH v3 2/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

The upcoming percpu section code uses two mviy instructions to guard the
beginning and end of a percpu code section.

The first mviy instruction writes the register number, which contains the
percpu address to lowcore. This indicates both the beginning of a percpu
code section and which register contains the percpu address.

During compile time the mviy instruction is generated in a way that its
base register contains the percpu register, and the immediate field is
zero. This needs to be patched so that the base register is zero, and the
immediate field contains the register number. For example

  101424:       eb 00 23 c0 00 52       mviy    960(%r2),0

needs to be patched to

  101424:       eb 20 03 c0 00 52       mviy    960(%r0),2

Provide a new ALT_TYPE_PERCPU alternative type which handles this specific
instruction patching. In addition it also handles the relocated lowcore
case, where the displacement of the mviy instruction has a different value.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/boot/alternative.c        |  7 +++++++
 arch/s390/include/asm/alternative.h |  5 +++++
 arch/s390/kernel/alternative.c      | 25 +++++++++++++++++++++++--
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/s390/boot/alternative.c b/arch/s390/boot/alternative.c
index 19ea7934b918..ad078a2b1192 100644
--- a/arch/s390/boot/alternative.c
+++ b/arch/s390/boot/alternative.c
@@ -22,6 +22,9 @@ static void alt_debug_all(int type)
 	case ALT_TYPE_SPEC:
 		alt_debug.spec = 1;
 		break;
+	case ALT_TYPE_PERCPU:
+		alt_debug.percpu = 1;
+		break;
 	}
 }
 
@@ -115,6 +118,7 @@ void alt_debug_setup(char *str)
 		alt_debug_all(ALT_TYPE_FACILITY);
 		alt_debug_all(ALT_TYPE_FEATURE);
 		alt_debug_all(ALT_TYPE_SPEC);
+		alt_debug_all(ALT_TYPE_PERCPU);
 		return;
 	}
 	while (*str) {
@@ -130,6 +134,9 @@ void alt_debug_setup(char *str)
 		case ALT_TYPE_SPEC:
 			alt_debug_all(ALT_TYPE_SPEC);
 			break;
+		case ALT_TYPE_PERCPU:
+			alt_debug_all(ALT_TYPE_PERCPU);
+			break;
 		}
 		if (*str != ';')
 			break;
diff --git a/arch/s390/include/asm/alternative.h b/arch/s390/include/asm/alternative.h
index 1c56480def9e..9ca2e49338a2 100644
--- a/arch/s390/include/asm/alternative.h
+++ b/arch/s390/include/asm/alternative.h
@@ -34,6 +34,7 @@
 #define ALT_TYPE_FACILITY	0
 #define ALT_TYPE_FEATURE	1
 #define ALT_TYPE_SPEC		2
+#define ALT_TYPE_PERCPU		3
 
 #define ALT_DATA_SHIFT		0
 #define ALT_TYPE_SHIFT		20
@@ -51,6 +52,10 @@
 					 ALT_TYPE_SPEC << ALT_TYPE_SHIFT	| \
 					 (facility) << ALT_DATA_SHIFT)
 
+#define ALT_PERCPU(num)			(ALT_CTX_EARLY << ALT_CTX_SHIFT		| \
+					 ALT_TYPE_PERCPU << ALT_TYPE_SHIFT	| \
+					 (num) << ALT_DATA_SHIFT)
+
 #ifndef __ASSEMBLER__
 
 #include <linux/types.h>
diff --git a/arch/s390/kernel/alternative.c b/arch/s390/kernel/alternative.c
index 02d04ae621ba..a79a11879c2f 100644
--- a/arch/s390/kernel/alternative.c
+++ b/arch/s390/kernel/alternative.c
@@ -28,6 +28,7 @@ struct alt_debug {
 	unsigned long facilities[MAX_FACILITY_BIT / BITS_PER_LONG];
 	unsigned long mfeatures[MAX_MFEATURE_BIT / BITS_PER_LONG];
 	int spec;
+	int percpu;
 };
 
 static struct alt_debug __bootdata_preserved(alt_debug);
@@ -48,8 +49,18 @@ static void alternative_dump(u8 *old, u8 *new, unsigned int len, unsigned int ty
 	a_debug("[%d/%3d] %016lx: %s -> %s\n", type, data, kptr, oinsn, ninsn);
 }
 
+struct insn_siy {
+	u64	opc1 : 8;
+	u64	i2   : 8;
+	u64	b1   : 4;
+	u64	dl1  : 12;
+	u64	dh1  : 8;
+	u64	opc2 : 8;
+} __packed;
+
 void __apply_alternatives(struct alt_instr *start, struct alt_instr *end, unsigned int ctx)
 {
+	struct insn_siy insn_siy;
 	struct alt_debug *d;
 	struct alt_instr *a;
 	bool debug, replace;
@@ -63,6 +74,8 @@ void __apply_alternatives(struct alt_instr *start, struct alt_instr *end, unsign
 	for (a = start; a < end; a++) {
 		if (!(a->ctx & ctx))
 			continue;
+		old = (u8 *)&a->instr_offset + a->instr_offset;
+		new = (u8 *)&a->repl_offset + a->repl_offset;
 		switch (a->type) {
 		case ALT_TYPE_FACILITY:
 			replace = test_facility(a->data);
@@ -76,14 +89,22 @@ void __apply_alternatives(struct alt_instr *start, struct alt_instr *end, unsign
 			replace = nobp_enabled();
 			debug = d->spec;
 			break;
+		case ALT_TYPE_PERCPU:
+			replace = true;
+			insn_siy = *(struct insn_siy *)old;
+			if (test_machine_feature(MFEATURE_LOWCORE))
+				insn_siy = *(struct insn_siy *)new;
+			insn_siy.i2 = insn_siy.b1;
+			insn_siy.b1 = 0;
+			new = (u8 *)&insn_siy;
+			debug = d->percpu;
+			break;
 		default:
 			replace = false;
 			debug = false;
 		}
 		if (!replace)
 			continue;
-		old = (u8 *)&a->instr_offset + a->instr_offset;
-		new = (u8 *)&a->repl_offset + a->repl_offset;
 		if (debug)
 			alternative_dump(old, new, a->instrlen, a->type, a->data);
 		s390_kernel_write(old, new, a->instrlen);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type
  2026-05-20  9:22 ` [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
@ 2026-05-20 12:43   ` David Laight
  2026-05-20 13:50     ` Heiko Carstens
  0 siblings, 1 reply; 37+ messages in thread
From: David Laight @ 2026-05-20 12:43 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Peter Zijlstra, Yang Shi,
	Shrikanth Hegde, linux-kernel, linux-s390

On Wed, 20 May 2026 11:22:35 +0200
Heiko Carstens <hca@linux.ibm.com> wrote:

> The upcoming percpu section code uses two mviy instructions to guard the
> beginning and end of a percpu code section.
> 
> The first mviy instruction writes the register number, which contains the
> percpu address to lowcore. This indicates both the beginning of a percpu
> code section and which register contains the percpu address.
> 
> During compile time the mviy instruction is generated in a way that its
> base register contains the percpu register, and the immediate field is
> zero. This needs to be patched so that the base register is zero, and the
> immediate field contains the register number. For example
> 
>   101424:       eb 00 23 c0 00 52       mviy    960(%r2),0
> 
> needs to be patched to
> 
>   101424:       eb 20 03 c0 00 52       mviy    960(%r0),2

I'm sure it is possible get the preprocessor to extract the register number
for you.
The exception table logic almost certainly already does it.
(The x86 version certainly does - and that is far less trivial.)

-- David


> 
> Provide a new ALT_TYPE_PERCPU alternative type which handles this specific
> instruction patching. In addition it also handles the relocated lowcore
> case, where the displacement of the mviy instruction has a different value.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type
  2026-05-20 12:43   ` David Laight
@ 2026-05-20 13:50     ` Heiko Carstens
  2026-05-20 14:16       ` Heiko Carstens
  0 siblings, 1 reply; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20 13:50 UTC (permalink / raw)
  To: David Laight
  Cc: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Peter Zijlstra, Yang Shi,
	Shrikanth Hegde, linux-kernel, linux-s390

On Wed, May 20, 2026 at 01:43:17PM +0100, David Laight wrote:
> On Wed, 20 May 2026 11:22:35 +0200
> Heiko Carstens <hca@linux.ibm.com> wrote:
> 
> > The upcoming percpu section code uses two mviy instructions to guard the
> > beginning and end of a percpu code section.
> > 
> > The first mviy instruction writes the register number, which contains the
> > percpu address to lowcore. This indicates both the beginning of a percpu
> > code section and which register contains the percpu address.
> > 
> > During compile time the mviy instruction is generated in a way that its
> > base register contains the percpu register, and the immediate field is
> > zero. This needs to be patched so that the base register is zero, and the
> > immediate field contains the register number. For example
> > 
> >   101424:       eb 00 23 c0 00 52       mviy    960(%r2),0
> > 
> > needs to be patched to
> > 
> >   101424:       eb 20 03 c0 00 52       mviy    960(%r0),2
> 
> I'm sure it is possible get the preprocessor to extract the register number
> for you.
> The exception table logic almost certainly already does it.
> (The x86 version certainly does - and that is far less trivial.)

That's true, the s390 extable logic is doing the same. However I failed to
feed the extracted register number as constant into the same inline assembly
it was extracted from.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type
  2026-05-20 13:50     ` Heiko Carstens
@ 2026-05-20 14:16       ` Heiko Carstens
  0 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20 14:16 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: David Laight, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Peter Zijlstra, Yang Shi,
	Shrikanth Hegde, linux-kernel, linux-s390

On Wed, May 20, 2026 at 03:50:45PM +0200, Heiko Carstens wrote:
> On Wed, May 20, 2026 at 01:43:17PM +0100, David Laight wrote:
> > On Wed, 20 May 2026 11:22:35 +0200
> > Heiko Carstens <hca@linux.ibm.com> wrote:
> > I'm sure it is possible get the preprocessor to extract the register number
> > for you.
> > The exception table logic almost certainly already does it.
> > (The x86 version certainly does - and that is far less trivial.)
> 
> That's true, the s390 extable logic is doing the same. However I failed to
> feed the extracted register number as constant into the same inline assembly
> it was extracted from.

Thinking about this again, I guess it should indeed be possible, and
my initial approach was wrong. Let me try again.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 2/9] s390/percpu: Infrastructure for more efficient this_cpu operations
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 3/9] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

E.g. this simple C code sequence

DEFINE_PER_CPU(long, foo);
long bar(long a) { return this_cpu_add_return(foo, a); }

generates this code:

  11a976:       eb af f0 68 00 24       stmg    %r10,%r15,104(%r15)
  11a97c:       b9 04 00 ef             lgr     %r14,%r15
  11a980:       b9 04 00 b2             lgr     %r11,%r2
  11a984:       e3 f0 ff c8 ff 71       lay     %r15,-56(%r15)
  11a98a:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
  11a990:       eb 01 03 a8 00 6a       asi     936,1            <- __preempt_count_add(1)
  11a996:       c0 10 00 d2 ac b5       larl    %r1,1b70300      <- address of percpu var
  11a9a0:       e3 10 23 b8 00 08       ag      %r1,952          <- add percpu offset
  11a9a6:       eb ab 10 00 00 e8       laag    %r10,%r11,0(%r1) <- atomic op
  11a9ac:       eb ff 03 a8 00 6e       alsi    936,-1           <- __preempt_count_dec_and_test()
  11a9b2:       a7 54 00 05             jnhe    11a9bc <bar+0x4c>
  11a9b6:       c0 e5 00 76 d1 bd       brasl   %r14,ff4d30 <preempt_schedule_notrace>
  11a9bc:       b9 e8 b0 2a             agrk    %r2,%r10,%r11
  11a9c0:       eb af f0 a0 00 04       lmg     %r10,%r15,160(%r15)
  11a9c6        07 fe                   br      %r14

Even though the above example is more or less the worst case, since the
branch to preempt_schedule_notrace() requires a stackframe, which
otherwise wouldn't be necessary, there is also the conditional jnhe branch
instruction.

Get rid of the conditional branch with the following code sequence:

  11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
  11a8ec:       b9 04 00 43             lgr     %r4,%r3
  11a8f0:       eb 00 43 c0 00 52       mviy    960,4
  11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
  11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
  11a902:       eb 00 03 c0 00 52       mviy    960,0
  11a908:       b9 08 00 25             agr     %r2,%r5
  11a90c        07 fe                   br      %r14

The general idea is that this_cpu operations based on atomic instructions
are guarded with mviy instructions:

- The first mviy instruction writes the register number, which contains
  the percpu address variable to lowcore. This also indicates that a
  percpu code section is executed.

- The first instruction following the mviy instruction must be the ag
  instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mviy instruction writes a zero to lowcore, which indicates
  the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
  written to lowcore is copied to the exception frame (pt_regs), and a zero
  is written to lowcore.

- On return to the previous context it is checked if a percpu code section
  was executed (saved register number not zero), and if the process was
  migrated to a different cpu. If the percpu offset was already added to
  the percpu address register (instruction address does _not_ point to the
  ag instruction) the content of the percpu address register is adjusted so
  it points to percpu variable of the new cpu.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/entry-percpu.h | 76 ++++++++++++++++++++++++++++
 arch/s390/include/asm/lowcore.h      |  3 +-
 arch/s390/include/asm/percpu.h       | 54 ++++++++++++++++++++
 arch/s390/include/asm/ptrace.h       |  2 +
 arch/s390/kernel/irq.c               | 26 +++++++---
 arch/s390/kernel/nmi.c               |  6 +++
 arch/s390/kernel/traps.c             |  6 +++
 7 files changed, 165 insertions(+), 8 deletions(-)
 create mode 100644 arch/s390/include/asm/entry-percpu.h

diff --git a/arch/s390/include/asm/entry-percpu.h b/arch/s390/include/asm/entry-percpu.h
new file mode 100644
index 000000000000..e25108e773ab
--- /dev/null
+++ b/arch/s390/include/asm/entry-percpu.h
@@ -0,0 +1,76 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ARCH_S390_ENTRY_PERCPU_H
+#define ARCH_S390_ENTRY_PERCPU_H
+
+#include <linux/kprobes.h>
+#include <linux/percpu.h>
+#include <asm/lowcore.h>
+#include <asm/ptrace.h>
+#include <asm/asm-offsets.h>
+
+static __always_inline void percpu_entry(struct pt_regs *regs)
+{
+	struct lowcore *lc = get_lowcore();
+
+	if (user_mode(regs))
+		return;
+	regs->cpu = lc->cpu_nr;
+	regs->percpu_register = lc->percpu_register;
+	lc->percpu_register = 0;
+}
+
+static __always_inline bool percpu_code_check(struct pt_regs *regs)
+{
+	unsigned int insn, disp;
+	struct kprobe *p;
+
+	if (user_mode(regs) || !regs->percpu_register)
+		return false;
+	/*
+	 * Within a percpu code section - check if the percpu base register
+	 * needs to be updated. This is the case if the PSW does not point to
+	 * the ADD instruction within the section.
+	 * - AG %rx,percpu_offset_in_lowcore(%r0,%r0)
+	 * which adds the percpu offset to the percpu base register.
+	 */
+	lockdep_assert_preemption_disabled();
+again:
+	insn = READ_ONCE(*(u16 *)psw_bits(regs->psw).ia);
+	if (unlikely(insn == BREAKPOINT_INSTRUCTION)) {
+		p = get_kprobe((void *)psw_bits(regs->psw).ia);
+		/*
+		 * If the kprobe is concurrently removed on a different CPU
+		 * it might not be found anymore. However text must have
+		 * been restored - try again.
+		 */
+		if (!p)
+			goto again;
+		insn = p->opcode;
+	}
+	if ((insn & 0xff0f) != 0xe300)
+		return false;
+	disp = offsetof(struct lowcore, percpu_offset);
+	if (machine_has_relocated_lowcore())
+		disp += LOWCORE_ALT_ADDRESS;
+	insn = (disp & 0xff000) >> 4 | (disp & 0x00fff) << 16 | 0x8;
+	if (*(u32 *)(psw_bits(regs->psw).ia + 2) != insn)
+		return false;
+	return true;
+}
+
+static __always_inline void percpu_fixup(struct pt_regs *regs)
+{
+	struct lowcore *lc = get_lowcore();
+	unsigned char reg;
+
+	reg = regs->percpu_register;
+	lc->percpu_register = reg;
+	/* Check if process has been migrated to a different CPU. */
+	if (regs->cpu == lc->cpu_nr)
+		return;
+	/* Fixup percpu base register */
+	regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
+	regs->gprs[reg] += lc->percpu_offset;
+}
+
+#endif
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index 50ffe75adeb4..cd1ddfdb5d35 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -165,7 +165,8 @@ struct lowcore {
 	__u32	spinlock_index;			/* 0x03b0 */
 	__u8	pad_0x03b4[0x03b8-0x03b4];	/* 0x03b4 */
 	__u64	percpu_offset;			/* 0x03b8 */
-	__u8	pad_0x03c0[0x0400-0x03c0];	/* 0x03c0 */
+	__u8	percpu_register;		/* 0x03c0 */
+	__u8	pad_0x03c1[0x0400-0x03c1];	/* 0x03c1 */
 
 	__u32	return_lpswe;			/* 0x0400 */
 	__u32	return_mcck_lpswe;		/* 0x0404 */
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index b18a96f3a334..1af622a8aa67 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -60,6 +60,60 @@
 #define this_cpu_or_1(pcp, val)		arch_this_cpu_to_op_simple(pcp, val, |)
 #define this_cpu_or_2(pcp, val)		arch_this_cpu_to_op_simple(pcp, val, |)
 
+/*
+ * Macros to be used for percpu code section based on atomic instructions.
+ *
+ * Avoid the need to use preempt_disable() / preempt_disable() pairs and the
+ * conditional preempt_schedule_notrace() function calls which come with
+ * this. The idea is that this_cpu operations based on atomic instructions are
+ * guarded with mviy instructions:
+ *
+ * - The first mviy instruction writes the register number, which contains the
+ *   percpu address variable to lowcore. This also indicates that a percpu
+ *   code section is executed.
+ *
+ * - The first mviy instruction following the mviy instruction must be the ag
+ *   instruction which adds the percpu offset to the percpu address register.
+ *
+ * - Afterwards the atomic percpu operation follows.
+ *
+ * - Then a second mviy instruction writes a zero to lowcore, which indicates
+ *   the end of the percpu code section.
+ *
+ * - In case of an interrupt/exception/nmi the register number which was
+ *   written to lowcore is copied to the exception frame (pt_regs), and a zero
+ *   is written to lowcore.
+ *
+ * - On return to the previous context it is checked if a percpu code section
+ *   was executed (saved register number not zero), and if the process was
+ *   migrated to a different cpu. If the percpu offset was already added to
+ *   the percpu address register (instruction address does _not_ point to the
+ *   ag instruction) the content of the percpu address register is adjusted so
+ *   it points to percpu variable of the new cpu.
+ *
+ * Inline assemblies making use of this typically have a code sequence like:
+ *
+ *   MVIY_PERCPU(...) <- start of percpu code section
+ *   AG_ALT(...)      <- add percpu offset; must be the second instruction
+ *   atomic_op	      <- atomic op
+ *   MVIY_ALT(...)    <- end of percpu code section
+ */
+
+#define MVIY_PERCPU(disp, dispalt, base)				\
+	ALTERNATIVE("	mviy	" disp	  "(" base " ),0\n",		\
+		    "	mviy	" dispalt "(" base " ),0\n",		\
+		    ALT_PERCPU(0))
+
+#define MVIY_ALT(disp, dispalt, base)					\
+	ALTERNATIVE("	mviy	" disp	  "(" base " ),0\n",		\
+		    "	mviy	" dispalt "(" base " ),0\n",		\
+		    ALT_FEATURE(MFEATURE_LOWCORE))
+
+#define AG_ALT(disp, dispalt, reg)					\
+	ALTERNATIVE("	ag	" reg ", " disp    "(%%r0)\n",		\
+		    "	ag	" reg ", " dispalt "(%%r0)\n",		\
+		    ALT_FEATURE(MFEATURE_LOWCORE))
+
 #ifndef MARCH_HAS_Z196_FEATURES
 
 #define this_cpu_add_4(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, +)
diff --git a/arch/s390/include/asm/ptrace.h b/arch/s390/include/asm/ptrace.h
index aaceb1d9110a..495e310c3d6d 100644
--- a/arch/s390/include/asm/ptrace.h
+++ b/arch/s390/include/asm/ptrace.h
@@ -134,6 +134,8 @@ struct pt_regs {
 	};
 	unsigned long flags;
 	unsigned long last_break;
+	unsigned int cpu;
+	unsigned char percpu_register;
 };
 
 /*
diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index d10a17e6531d..92fdc2ae96f8 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -33,6 +33,7 @@
 #include <asm/softirq_stack.h>
 #include <asm/vtime.h>
 #include <asm/asm.h>
+#include <asm/entry-percpu.h>
 #include "entry.h"
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
@@ -142,10 +143,13 @@ static int irq_pending(struct pt_regs *regs)
 
 void noinstr do_io_irq(struct pt_regs *regs)
 {
-	irqentry_state_t state = irqentry_enter(regs);
-	struct pt_regs *old_regs = set_irq_regs(regs);
-	bool from_idle;
+	bool from_idle, percpu_needs_fixup;
+	struct pt_regs *old_regs;
+	irqentry_state_t state;
 
+	percpu_entry(regs);
+	state = irqentry_enter(regs);
+	old_regs = set_irq_regs(regs);
 	from_idle = test_and_clear_cpu_flag(CIF_ENABLED_WAIT);
 	if (from_idle)
 		update_timer_idle();
@@ -170,21 +174,26 @@ void noinstr do_io_irq(struct pt_regs *regs)
 			do_irq_async(regs, IO_INTERRUPT);
 	} while (machine_is_lpar() && irq_pending(regs));
 
+	percpu_needs_fixup = percpu_code_check(regs);
 	irq_exit_rcu();
-
 	set_irq_regs(old_regs);
 	irqentry_exit(regs, state);
 
 	if (from_idle)
 		regs->psw.mask &= ~(PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_WAIT);
+	if (percpu_needs_fixup)
+		percpu_fixup(regs);
 }
 
 void noinstr do_ext_irq(struct pt_regs *regs)
 {
-	irqentry_state_t state = irqentry_enter(regs);
-	struct pt_regs *old_regs = set_irq_regs(regs);
-	bool from_idle;
+	bool from_idle, percpu_needs_fixup;
+	struct pt_regs *old_regs;
+	irqentry_state_t state;
 
+	percpu_entry(regs);
+	state = irqentry_enter(regs);
+	old_regs = set_irq_regs(regs);
 	from_idle = test_and_clear_cpu_flag(CIF_ENABLED_WAIT);
 	if (from_idle)
 		update_timer_idle();
@@ -206,12 +215,15 @@ void noinstr do_ext_irq(struct pt_regs *regs)
 
 	do_irq_async(regs, EXT_INTERRUPT);
 
+	percpu_needs_fixup = percpu_code_check(regs);
 	irq_exit_rcu();
 	set_irq_regs(old_regs);
 	irqentry_exit(regs, state);
 
 	if (from_idle)
 		regs->psw.mask &= ~(PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_WAIT);
+	if (percpu_needs_fixup)
+		percpu_fixup(regs);
 }
 
 static void show_msi_interrupt(struct seq_file *p, int irq)
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index 94fbfad49f62..d43cc18fe9be 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -22,6 +22,7 @@
 #include <linux/module.h>
 #include <linux/sched/signal.h>
 #include <linux/kvm_host.h>
+#include <asm/entry-percpu.h>
 #include <asm/lowcore.h>
 #include <asm/ctlreg.h>
 #include <asm/fpu.h>
@@ -363,6 +364,7 @@ NOKPROBE_SYMBOL(s390_backup_mcck_info);
  */
 void notrace s390_do_machine_check(struct pt_regs *regs)
 {
+	bool percpu_needs_fixup;
 	static int ipd_count;
 	static DEFINE_SPINLOCK(ipd_lock);
 	static unsigned long long last_ipd;
@@ -374,6 +376,7 @@ void notrace s390_do_machine_check(struct pt_regs *regs)
 	unsigned long mcck_dam_code;
 	int mcck_pending = 0;
 
+	percpu_entry(regs);
 	irq_state = irqentry_nmi_enter(regs);
 
 	if (user_mode(regs))
@@ -495,7 +498,10 @@ void notrace s390_do_machine_check(struct pt_regs *regs)
 	if (mcck_pending)
 		schedule_mcck_handler();
 
+	percpu_needs_fixup = percpu_code_check(regs);
 	irqentry_nmi_exit(regs, irq_state);
+	if (percpu_needs_fixup)
+		percpu_fixup(regs);
 }
 NOKPROBE_SYMBOL(s390_do_machine_check);
 
diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 1b5c6fc431cc..fb16e9bee80b 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -24,6 +24,7 @@
 #include <linux/entry-common.h>
 #include <linux/kmsan.h>
 #include <linux/bug.h>
+#include <asm/entry-percpu.h>
 #include <asm/asm-extable.h>
 #include <asm/irqflags.h>
 #include <asm/ptrace.h>
@@ -329,6 +330,7 @@ static void (*pgm_check_table[128])(struct pt_regs *regs);
 void noinstr __do_pgm_check(struct pt_regs *regs)
 {
 	struct lowcore *lc = get_lowcore();
+	bool percpu_needs_fixup;
 	irqentry_state_t state;
 	unsigned int trapnr;
 	union teid teid;
@@ -349,6 +351,7 @@ void noinstr __do_pgm_check(struct pt_regs *regs)
 		current->thread.gmap_int_code = regs->int_code & 0xffff;
 		return;
 	}
+	percpu_entry(regs);
 	state = irqentry_enter(regs);
 	if (user_mode(regs)) {
 		update_timer_sys();
@@ -385,7 +388,10 @@ void noinstr __do_pgm_check(struct pt_regs *regs)
 		pgm_check_table[trapnr](regs);
 out:
 	local_irq_disable();
+	percpu_needs_fixup = percpu_code_check(regs);
 	irqentry_exit(regs, state);
+	if (percpu_needs_fixup)
+		percpu_fixup(regs);
 }
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 3/9] s390/percpu: Add missing do { } while (0) constructs
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 2/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

Add missing do { } while (0) constructs in order to avoid potential
build failures.

Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca%40linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 1af622a8aa67..c8fc8b320a86 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -128,7 +128,7 @@
 #else /* MARCH_HAS_Z196_FEATURES */
 
 #define arch_this_cpu_add(pcp, val, op1, op2, szcast)			\
-{									\
+do {									\
 	typedef typeof(pcp) pcp_op_T__; 				\
 	pcp_op_T__ val__ = (val);					\
 	pcp_op_T__ old__, *ptr__;					\
@@ -149,7 +149,7 @@
 			: "cc");					\
 	}								\
 	preempt_enable_notrace();					\
-}
+} while (0)
 
 #define this_cpu_add_4(pcp, val) arch_this_cpu_add(pcp, val, "laa", "asi", int)
 #define this_cpu_add_8(pcp, val) arch_this_cpu_add(pcp, val, "laag", "agsi", long)
@@ -174,7 +174,7 @@
 #define this_cpu_add_return_8(pcp, val) arch_this_cpu_add_return(pcp, val, "laag")
 
 #define arch_this_cpu_to_op(pcp, val, op)				\
-{									\
+do {									\
 	typedef typeof(pcp) pcp_op_T__; 				\
 	pcp_op_T__ val__ = (val);					\
 	pcp_op_T__ old__, *ptr__;					\
@@ -186,7 +186,7 @@
 		: [val__] "d" (val__)					\
 		: "cc");						\
 	preempt_enable_notrace();					\
-}
+} while (0)
 
 #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
 #define this_cpu_and_8(pcp, val)	arch_this_cpu_to_op(pcp, val, "lang")
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add()
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (2 preceding siblings ...)
  2026-05-20  9:22 ` [PATCH v3 3/9] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

Convert arch_this_cpu_add() to make use of the new percpu code section
infrastructure.

With this the text size of the kernel image is reduced by ~76kb
(defconfig). Also more than 5300 generated preempt_schedule_notrace()
function calls within the kernel image (modules not counted) are removed.

With:

DEFINE_PER_CPU(long, foo);
void bar(long a) { this_cpu_add(foo, a); }

Old arch_this_cpu_add() looks like this:

00000000000000c0 <bar>:
  c0:   c0 04 00 00 00 00       jgnop   c0 <bar>
  c6:   eb 01 03 a8 00 6a       asi     936,1
  cc:   c4 18 00 00 00 00       lgrl    %r1,cc <bar+0xc>
                        ce: R_390_GOTENT        foo+0x2
  d2:   e3 10 03 b8 00 08       ag      %r1,952
  d8:   eb 22 10 00 00 e8       laag    %r2,%r2,0(%r1)
  de:   eb ff 03 a8 00 6e       alsi    936,-1
  e4:   a7 a4 00 05             jhe     ee <bar+0x2e>
  e8:   c0 f4 00 00 00 00       jg      e8 <bar+0x28>
                        ea: R_390_PC32DBL       __s390_indirect_jump_r14+0x2
  ee:   c0 f4 00 00 00 00       jg      ee <bar+0x2e>
                        f0: R_390_PLT32DBL      preempt_schedule_notrace+0x2

New arch_this_cpu_add() looks like this:

00000000000000c0 <bar>:
  c0:   c0 04 00 00 00 00       jgnop   c0 <bar>
  c6:   c4 38 00 00 00 00       lgrl    %r3,c6 <bar+0x6>
                        c8: R_390_GOTENT        foo+0x2
  cc:   b9 04 00 43             lgr     %r4,%r3
  d0:   eb 00 43 c0 00 52       mviy    960(%r0),4
  d6:   e3 40 03 b8 00 08       ag      %r4,952
  dc:   eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
  e2:   eb 00 03 c0 00 52       mviy    960,0
  e8:   c0 f4 00 00 00 00       jg      e8 <bar+0x28>
                        ea: R_390_PC32DBL       __s390_indirect_jump_r14+0x2

Note that the conditional function call is removed.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 65 ++++++++++++++++++++++------------
 1 file changed, 43 insertions(+), 22 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index c8fc8b320a86..459603c7305a 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -127,28 +127,49 @@
 
 #else /* MARCH_HAS_Z196_FEATURES */
 
-#define arch_this_cpu_add(pcp, val, op1, op2, szcast)			\
-do {									\
-	typedef typeof(pcp) pcp_op_T__; 				\
-	pcp_op_T__ val__ = (val);					\
-	pcp_op_T__ old__, *ptr__;					\
-	preempt_disable_notrace();					\
-	ptr__ = raw_cpu_ptr(&(pcp)); 				\
-	if (__builtin_constant_p(val__) &&				\
-	    ((szcast)val__ > -129) && ((szcast)val__ < 128)) {		\
-		asm volatile(						\
-			op2 "   %[ptr__],%[val__]"			\
-			: [ptr__] "+Q" (*ptr__) 			\
-			: [val__] "i" ((szcast)val__)			\
-			: "cc");					\
-	} else {							\
-		asm volatile(						\
-			op1 "   %[old__],%[val__],%[ptr__]"		\
-			: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)	\
-			: [val__] "d" (val__)				\
-			: "cc");					\
-	}								\
-	preempt_enable_notrace();					\
+#define arch_this_cpu_add(pcp, val, op1, op2, szcast)				\
+do {										\
+	unsigned long lc_pcpr, lc_pcpo;						\
+	typedef typeof(pcp) pcp_op_T__;						\
+	pcp_op_T__ val__ = (val);						\
+	pcp_op_T__ old__, *ptr__;						\
+										\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);			\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);			\
+	ptr__ = PERCPU_PTR(&(pcp));						\
+	if (__builtin_constant_p(val__) &&					\
+	    ((szcast)val__ > -129) && ((szcast)val__ < 128)) {			\
+		asm volatile(							\
+			MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+			AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+			op2 "   0(%[ptr__]),%[val__]\n"				\
+			MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+			: [ptr__] "+&a" (ptr__), "+m" (*ptr__),			\
+			  "=m" (((struct lowcore *)0)->percpu_register)		\
+			: [val__] "i" ((szcast)val__),				\
+			  [disppcpr] "i" (lc_pcpr),				\
+			  [disppcpo] "i" (lc_pcpo),				\
+			  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+			  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+			  "m" (((struct lowcore *)0)->percpu_offset)		\
+			: "cc");						\
+	} else {								\
+		asm volatile(							\
+			MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+			AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+			op1 "   %[old__],%[val__],0(%[ptr__])\n"		\
+			MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+			: [old__] "=&d" (old__),				\
+			  [ptr__] "+&a" (ptr__),  "+m" (*ptr__),		\
+			  "=m" (((struct lowcore *)0)->percpu_register)		\
+			: [val__] "d" (val__),					\
+			  [disppcpr] "i" (lc_pcpr),				\
+			  [disppcpo] "i" (lc_pcpo),				\
+			  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+			  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+			  "m" (((struct lowcore *)0)->percpu_offset)		\
+			: "cc");						\
+	}									\
 } while (0)
 
 #define this_cpu_add_4(pcp, val) arch_this_cpu_add(pcp, val, "laa", "asi", int)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (3 preceding siblings ...)
  2026-05-20  9:22 ` [PATCH v3 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

Convert arch_this_cpu_add_return() to make use of the new percpu code
section infrastructure.

With this the text size of the kernel image is reduced by ~4k
(defconfig). Also 66 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 459603c7305a..b3da863954c5 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -177,17 +177,29 @@ do {										\
 
 #define arch_this_cpu_add_return(pcp, val, op)				\
 ({									\
+	unsigned long lc_pcpr, lc_pcpo;					\
 	typedef typeof(pcp) pcp_op_T__; 				\
 	pcp_op_T__ val__ = (val);					\
 	pcp_op_T__ old__, *ptr__;					\
-	preempt_disable_notrace();					\
-	ptr__ = raw_cpu_ptr(&(pcp));	 				\
-	asm volatile(							\
-		op "    %[old__],%[val__],%[ptr__]"			\
-		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
-		: [val__] "d" (val__)					\
+									\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);		\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);		\
+	ptr__ = PERCPU_PTR(&(pcp));					\
+	asm_inline volatile(						\
+		MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+		AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+		op "	%[old__],%[val__],0(%[ptr__])\n"		\
+		MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+		: [old__] "=&d" (old__),				\
+		  [ptr__] "+&a" (ptr__), "+m" (*ptr__),			\
+		  "=m" (((struct lowcore *)0)->percpu_register)		\
+		: [val__] "d" (val__),					\
+		  [disppcpr] "i" (lc_pcpr),				\
+		  [disppcpo] "i" (lc_pcpo),				\
+		  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+		  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+		  "m" (((struct lowcore *)0)->percpu_offset)		\
 		: "cc");						\
-	preempt_enable_notrace();						\
 	old__ + val__;							\
 })
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (4 preceding siblings ...)
  2026-05-20  9:22 ` [PATCH v3 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

Convert arch_this_cpu_[and|or]() to make use of the new percpu code
section infrastructure.

There is no user of this_cpu_and() and only one user of this_cpu_or()
within the kernel. Therefore this conversion has hardly any effect,
and also removes only preempt_schedule_notrace() function call.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index b3da863954c5..fa3dc3de5f8b 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -208,17 +208,29 @@ do {										\
 
 #define arch_this_cpu_to_op(pcp, val, op)				\
 do {									\
+	unsigned long lc_pcpr, lc_pcpo;					\
 	typedef typeof(pcp) pcp_op_T__; 				\
 	pcp_op_T__ val__ = (val);					\
 	pcp_op_T__ old__, *ptr__;					\
-	preempt_disable_notrace();					\
-	ptr__ = raw_cpu_ptr(&(pcp));	 				\
-	asm volatile(							\
-		op "    %[old__],%[val__],%[ptr__]"			\
-		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
-		: [val__] "d" (val__)					\
+									\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);		\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);		\
+	ptr__ = PERCPU_PTR(&(pcp));					\
+	asm_inline volatile(						\
+		MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+		AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+		op "    %[old__],%[val__],0(%[ptr__])\n"		\
+		MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+		: [old__] "=&d" (old__),				\
+		  [ptr__] "+&a" (ptr__), "+m" (*ptr__),			\
+		  "=m" (((struct lowcore *)0)->percpu_register)		\
+		: [val__] "d" (val__),					\
+		  [disppcpr] "i" (lc_pcpr),				\
+		  [disppcpo] "i" (lc_pcpo),				\
+		  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+		  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+		  "m" (((struct lowcore *)0)->percpu_offset)		\
 		: "cc");						\
-	preempt_enable_notrace();					\
 } while (0)
 
 #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 7/9] s390/percpu: Provide arch_this_cpu_read() implementation
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (5 preceding siblings ...)
  2026-05-20  9:22 ` [PATCH v3 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

Provide an s390 specific implementation of arch_this_cpu_read() instead
of the generic variant. The generic variant uses preempt_disable() /
preempt_enable() pair and READ_ONCE().

Get rid of the preempt_disable() / preempt_enable() pairs by providing an
own variant which makes use of the new percpu code section infrastructure.

With this the text size of the kernel image is reduced by ~1k
(defconfig). Also 87 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index fa3dc3de5f8b..abd774fdd73a 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -240,6 +240,37 @@ do {									\
 
 #endif /* MARCH_HAS_Z196_FEATURES */
 
+#define arch_this_cpu_read(pcp, op)					\
+({									\
+	unsigned long lc_pcpr, lc_pcpo, res__;				\
+	typedef typeof(pcp) pcp_op_T__;					\
+	pcp_op_T__ *ptr__;						\
+									\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);		\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);		\
+	ptr__ = PERCPU_PTR(&(pcp));					\
+	asm_inline volatile(						\
+		MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+		AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+		op "	%[res__],0(%[ptr__])\n"				\
+		MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+		: [res__] "=&d" (res__), [ptr__] "+&a" (ptr__),		\
+		  "=m" (((struct lowcore *)0)->percpu_register)		\
+		: [disppcpr] "i" (lc_pcpr),				\
+		  [disppcpo] "i" (lc_pcpo),				\
+		  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+		  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+		  "m" (*ptr__),						\
+		  "m" (((struct lowcore *)0)->percpu_offset)		\
+		: "cc");						\
+	(pcp_op_T__)res__;						\
+})
+
+#define this_cpu_read_1(pcp) arch_this_cpu_read(pcp, "ic")
+#define this_cpu_read_2(pcp) arch_this_cpu_read(pcp, "lh")
+#define this_cpu_read_4(pcp) arch_this_cpu_read(pcp, "l")
+#define this_cpu_read_8(pcp) arch_this_cpu_read(pcp, "lg")
+
 #define arch_this_cpu_cmpxchg(pcp, oval, nval)				\
 ({									\
 	typedef typeof(pcp) pcp_op_T__;					\
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 8/9] s390/percpu: Provide arch_this_cpu_write() implementation
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (6 preceding siblings ...)
  2026-05-20  9:22 ` [PATCH v3 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20  9:22 ` [PATCH v3 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
  2026-05-20 18:42 ` [PATCH v3 0/9] s390: Improve this_cpu operations Yang Shi
  9 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

Provide an s390 specific implementation of arch_this_cpu_write()
instead of the generic variant. The generic variant uses a quite
expensive raw_local_irq_save() / raw_local_irq_restore() pair.

Get rid of this by providing an own variant which makes use of the new
percpu code section infrastructure.

With this the text size of the kernel image is reduced by ~1k (defconfig).

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index abd774fdd73a..44867e7ed0df 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -271,6 +271,36 @@ do {									\
 #define this_cpu_read_4(pcp) arch_this_cpu_read(pcp, "l")
 #define this_cpu_read_8(pcp) arch_this_cpu_read(pcp, "lg")
 
+#define arch_this_cpu_write(pcp, val, op)				\
+do {									\
+	unsigned long lc_pcpr, lc_pcpo;					\
+	typedef typeof(pcp) pcp_op_T__;					\
+	pcp_op_T__ *ptr__, val__ = (val);				\
+									\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);		\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);		\
+	ptr__ = PERCPU_PTR(&(pcp));					\
+	asm_inline volatile(						\
+		MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+		AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+		op "    %[val__],0(%[ptr__])\n"				\
+		MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+		: [ptr__] "+&a" (ptr__), "=m" (*ptr__),			\
+		  "=m" (((struct lowcore *)0)->percpu_register)		\
+		: [val__] "d" (val__),					\
+		  [disppcpr] "i" (lc_pcpr),				\
+		  [disppcpo] "i" (lc_pcpo),				\
+		  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+		  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+		  "m" (((struct lowcore *)0)->percpu_offset)		\
+		: "cc");						\
+} while (0)
+
+#define this_cpu_write_1(pcp, val) arch_this_cpu_write(pcp, val, "stc")
+#define this_cpu_write_2(pcp, val) arch_this_cpu_write(pcp, val, "sth")
+#define this_cpu_write_4(pcp, val) arch_this_cpu_write(pcp, val, "st")
+#define this_cpu_write_8(pcp, val) arch_this_cpu_write(pcp, val, "stg")
+
 #define arch_this_cpu_cmpxchg(pcp, oval, nval)				\
 ({									\
 	typedef typeof(pcp) pcp_op_T__;					\
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (7 preceding siblings ...)
  2026-05-20  9:22 ` [PATCH v3 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
@ 2026-05-20  9:22 ` Heiko Carstens
  2026-05-20 18:42 ` [PATCH v3 0/9] s390: Improve this_cpu operations Yang Shi
  9 siblings, 0 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-20  9:22 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, Yang Shi, Shrikanth Hegde, linux-kernel,
	linux-s390

There are no one and two byte this_cpu operations within the kernel
(defconfig). However even if there would be, the s390 implementation, which
uses a cmpxchg loop, generates a very large code sequence due to the lack
of native one and two byte cmpxchg instructions.

Remove the s390 implementation and use the generic implementation.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 44867e7ed0df..72237cb962c2 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -51,15 +51,6 @@
 	new__;								\
 })
 
-#define this_cpu_add_1(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, +)
-#define this_cpu_add_2(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, +)
-#define this_cpu_add_return_1(pcp, val) arch_this_cpu_to_op_simple(pcp, val, +)
-#define this_cpu_add_return_2(pcp, val) arch_this_cpu_to_op_simple(pcp, val, +)
-#define this_cpu_and_1(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, &)
-#define this_cpu_and_2(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, &)
-#define this_cpu_or_1(pcp, val)		arch_this_cpu_to_op_simple(pcp, val, |)
-#define this_cpu_or_2(pcp, val)		arch_this_cpu_to_op_simple(pcp, val, |)
-
 /*
  * Macros to be used for percpu code section based on atomic instructions.
  *
@@ -313,8 +304,6 @@ do {									\
 	ret__;								\
 })
 
-#define this_cpu_cmpxchg_1(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
-#define this_cpu_cmpxchg_2(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_4(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 #define this_cpu_cmpxchg_8(pcp, oval, nval) arch_this_cpu_cmpxchg(pcp, oval, nval)
 
@@ -345,8 +334,6 @@ do {									\
 	ret__;								\
 })
 
-#define this_cpu_xchg_1(pcp, nval) arch_this_cpu_xchg(pcp, nval)
-#define this_cpu_xchg_2(pcp, nval) arch_this_cpu_xchg(pcp, nval)
 #define this_cpu_xchg_4(pcp, nval) arch_this_cpu_xchg(pcp, nval)
 #define this_cpu_xchg_8(pcp, nval) arch_this_cpu_xchg(pcp, nval)
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (8 preceding siblings ...)
  2026-05-20  9:22 ` [PATCH v3 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
@ 2026-05-20 18:42 ` Yang Shi
  2026-05-20 22:34   ` David Laight
  9 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2026-05-20 18:42 UTC (permalink / raw)
  To: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere)
  Cc: Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

Hi Heiko,

Thanks for cc'ing me the patchset. Please see the below inline comments.


On 5/20/26 2:22 AM, Heiko Carstens wrote:
> v3:
> - Fix various typos [Juergen Christ]
>
> - Add missing kprobe detection / handling [Sashiko [3]]
>    [FWIW, this made me also aware of that the current general s390 kprobes
>     code seems to be racy against concurrent removal of a kprobe while a
>     probe hit on a different CPU. But that is a different story.]
>
> - Fix various minor findings [Sashiko [3]]
>
> - All of this might be dropped / exchanged in future in favor of the percpu
>    page table approach proposed by Yang Shi [4].

Thanks for mentioning my approach. I will do some comparison with rseq 
in the following design details section of the cover letter.

>
> [3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
> [4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/
>
> v2:
>
> - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
>    warnings
>
> - Add missing __packed attribute to insn structure [Sashiko [2]]
>
> - Fix inverted if condition [Sashiko [2]]
>
> - Add missing user_mode() check [Sashiko [2]]
>
> - Move percpu_entry() call in front of irqentry_enter() call in all
>    entry paths to avoid that potential this_cpu() operations overwrite
>    the not-yet saved percpu code section indicator  [Sashiko [2]]
>
> [2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
>
> v1:
>
> This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
>
> With the intended removal of PREEMPT_NONE this_cpu operations based on
> atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
> become more expensive: the preempt_disable() / preempt_enable() pairs are
> not optimized away anymore during compile time.
>
> In particular the conditional call to preempt_schedule_notrace() after
> preempt_enable() adds additional code and register pressure.
>
> To avoid this Peter suggested an in-kernel rseq approach. While this would
> certainly work, this series tries to come up with a solution which uses
> less instructions and doesn't require to repeat instruction sequences.
>
> The idea is that this_cpu operations based on atomic instructions are
> guarded with mvyi instructions:
>
> - The first mvyi instruction writes the register number, which contains
>    the percpu address variable to lowcore. This also indicates that a
>    percpu code section is executed.
>
> - The first instruction following the mvyi instruction must be the ag
>    instruction which adds the percpu offset to the percpu address register.
>
> - Afterwards the atomic percpu operation follows.
>
> - Then a second mvyi instruction writes a zero to lowcore, which indicates
>    the end of the percpu code section.
>
> - In case of an interrupt/exception/nmi the register number which was
>    written to lowcore is copied to the exception frame (pt_regs), and a zero
>    is written to lowcore.
>
> - On return to the previous context it is checked if a percpu code section
>    was executed (saved register number not zero), and if the process was
>    migrated to a different cpu. If the percpu offset was already added to
>    the percpu address register (instruction address does _not_ point to the
>    ag instruction) the content of the percpu address register is adjusted so
>    it points to percpu variable of the new cpu.

If I understand correctly, you replaced preempt_disable() and 
preempt_enable() with seq begin and seg end, and seq begin and seq end 
can be optimized by mvyi instruction on S390. So you just need a single 
mvyi instruction for each instead of read-modify-write the seq count.

But you need some extra overhead for context switch (save and restore 
the seq count register) and need to check whether it is still on the 
same cpu once resuming execution. And there is also penalty if it is 
migrated to another CPU (need to rerun this_cpu ops).

So it seems have more overhead than the percpu page table approach IIUC. 
We don't need all the steps with percpu page table. And there is no 
penalty for migration.

>
> All of this seems to work, but of course it could still be broken since I
> missed some detail.
>
> In total this series results in a kernel text size reduction of ~106kb. The
> number of preempt_schedule_notrace() call sites is reduced from 7089 to
> 1577.

Yeah, both approaches can reduce the number of 
preempt_schedule_notrace() call sites. And both approaches can reduce 
the number of non-preemptible critical sections.

>
> Note: this comes without any huge performance analysis, however all
> microbenchmarks confirmed that the new code is at least as fast as the
> old code, like expected.

I'm really interested in the benchmark number. I'm supposed percpu page 
table approach should have better performance per my above analysis.

Christopher Lameter is also interested in it, cc'ed him too.

Thanks,
Yang

>
> [1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net
>
> Heiko Carstens (9):
>    s390/alternatives: Add new ALT_TYPE_PERCPU type
>    s390/percpu: Infrastructure for more efficient this_cpu operations
>    s390/percpu: Add missing do { } while (0) constructs
>    s390/percpu: Use new percpu code section for arch_this_cpu_add()
>    s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
>    s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
>    s390/percpu: Provide arch_this_cpu_read() implementation
>    s390/percpu: Provide arch_this_cpu_write() implementation
>    s390/percpu: Remove one and two byte this_cpu operation implementation
>
>   arch/s390/boot/alternative.c         |   7 +
>   arch/s390/include/asm/alternative.h  |   5 +
>   arch/s390/include/asm/entry-percpu.h |  76 ++++++++
>   arch/s390/include/asm/lowcore.h      |   3 +-
>   arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
>   arch/s390/include/asm/ptrace.h       |   2 +
>   arch/s390/kernel/alternative.c       |  25 ++-
>   arch/s390/kernel/irq.c               |  26 ++-
>   arch/s390/kernel/nmi.c               |   6 +
>   arch/s390/kernel/traps.c             |   6 +
>   10 files changed, 344 insertions(+), 61 deletions(-)
>   create mode 100644 arch/s390/include/asm/entry-percpu.h
>
> base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-20 18:42 ` [PATCH v3 0/9] s390: Improve this_cpu operations Yang Shi
@ 2026-05-20 22:34   ` David Laight
  2026-05-21  0:23     ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: David Laight @ 2026-05-20 22:34 UTC (permalink / raw)
  To: Yang Shi
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Wed, 20 May 2026 11:42:36 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> Hi Heiko,
> 
> Thanks for cc'ing me the patchset. Please see the below inline comments.
> 
> 
> On 5/20/26 2:22 AM, Heiko Carstens wrote:
> > v3:
> > - Fix various typos [Juergen Christ]
> >
> > - Add missing kprobe detection / handling [Sashiko [3]]
> >    [FWIW, this made me also aware of that the current general s390 kprobes
> >     code seems to be racy against concurrent removal of a kprobe while a
> >     probe hit on a different CPU. But that is a different story.]
> >
> > - Fix various minor findings [Sashiko [3]]
> >
> > - All of this might be dropped / exchanged in future in favor of the percpu
> >    page table approach proposed by Yang Shi [4].  
> 
> Thanks for mentioning my approach. I will do some comparison with rseq 
> in the following design details section of the cover letter.
> 
> >
> > [3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
> > [4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/
> >
> > v2:
> >
> > - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
> >    warnings
> >
> > - Add missing __packed attribute to insn structure [Sashiko [2]]
> >
> > - Fix inverted if condition [Sashiko [2]]
> >
> > - Add missing user_mode() check [Sashiko [2]]
> >
> > - Move percpu_entry() call in front of irqentry_enter() call in all
> >    entry paths to avoid that potential this_cpu() operations overwrite
> >    the not-yet saved percpu code section indicator  [Sashiko [2]]
> >
> > [2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
> >
> > v1:
> >
> > This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
> >
> > With the intended removal of PREEMPT_NONE this_cpu operations based on
> > atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
> > become more expensive: the preempt_disable() / preempt_enable() pairs are
> > not optimized away anymore during compile time.
> >
> > In particular the conditional call to preempt_schedule_notrace() after
> > preempt_enable() adds additional code and register pressure.
> >
> > To avoid this Peter suggested an in-kernel rseq approach. While this would
> > certainly work, this series tries to come up with a solution which uses
> > less instructions and doesn't require to repeat instruction sequences.
> >
> > The idea is that this_cpu operations based on atomic instructions are
> > guarded with mvyi instructions:
> >
> > - The first mvyi instruction writes the register number, which contains
> >    the percpu address variable to lowcore. This also indicates that a
> >    percpu code section is executed.
> >
> > - The first instruction following the mvyi instruction must be the ag
> >    instruction which adds the percpu offset to the percpu address register.
> >
> > - Afterwards the atomic percpu operation follows.
> >
> > - Then a second mvyi instruction writes a zero to lowcore, which indicates
> >    the end of the percpu code section.
> >
> > - In case of an interrupt/exception/nmi the register number which was
> >    written to lowcore is copied to the exception frame (pt_regs), and a zero
> >    is written to lowcore.
> >
> > - On return to the previous context it is checked if a percpu code section
> >    was executed (saved register number not zero), and if the process was
> >    migrated to a different cpu. If the percpu offset was already added to
> >    the percpu address register (instruction address does _not_ point to the
> >    ag instruction) the content of the percpu address register is adjusted so
> >    it points to percpu variable of the new cpu.  
> 
> If I understand correctly, you replaced preempt_disable() and 
> preempt_enable() with seq begin and seg end, and seq begin and seq end 
> can be optimized by mvyi instruction on S390. So you just need a single 
> mvyi instruction for each instead of read-modify-write the seq count.
> 
> But you need some extra overhead for context switch (save and restore 
> the seq count register) and need to check whether it is still on the 
> same cpu once resuming execution. And there is also penalty if it is 
> migrated to another CPU (need to rerun this_cpu ops).

Not as I understand it.
What happens is the context switch code 'corrupts' the register being
used to access per-cpu data so that it is correct for the new cpu.
The write of zero after the sequence is there to stop the register
being corrupted outside of this code window.

This really just means that you can (mostly) only do single accesses,
since nothing stops pre-emption between the RW or an RMW sequence.
Although you can probably do an increment of the preempt disable count
because if you are preempted the value read will be zero.

> 
> So it seems have more overhead than the percpu page table approach IIUC. 
> We don't need all the steps with percpu page table. And there is no 
> penalty for migration.

This code looks like it relies on 'page zero' already being percpu.
So it probably isn't really that different.
Some values like the 'preemption disable count' and 'current' could be
(maybe are?) written into page zero to give fast access.

But I'm sure I remember that some cpu don't like having the same
physical address at different virtual addresses (and not just those
with VIVT caches like some sparc cpu).
I'm sure code can end up accessing the current cpu's percpu data
using the same address that other cpu use - there are definitely
places where it needs that address.
On x86-64 that means it reading the address from the array rather
than just offsetting from %gs.

-- David

> 
> >
> > All of this seems to work, but of course it could still be broken since I
> > missed some detail.
> >
> > In total this series results in a kernel text size reduction of ~106kb. The
> > number of preempt_schedule_notrace() call sites is reduced from 7089 to
> > 1577.  
> 
> Yeah, both approaches can reduce the number of 
> preempt_schedule_notrace() call sites. And both approaches can reduce 
> the number of non-preemptible critical sections.
> 
> >
> > Note: this comes without any huge performance analysis, however all
> > microbenchmarks confirmed that the new code is at least as fast as the
> > old code, like expected.  
> 
> I'm really interested in the benchmark number. I'm supposed percpu page 
> table approach should have better performance per my above analysis.
> 
> Christopher Lameter is also interested in it, cc'ed him too.
> 
> Thanks,
> Yang
> 
> >
> > [1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net
> >
> > Heiko Carstens (9):
> >    s390/alternatives: Add new ALT_TYPE_PERCPU type
> >    s390/percpu: Infrastructure for more efficient this_cpu operations
> >    s390/percpu: Add missing do { } while (0) constructs
> >    s390/percpu: Use new percpu code section for arch_this_cpu_add()
> >    s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
> >    s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
> >    s390/percpu: Provide arch_this_cpu_read() implementation
> >    s390/percpu: Provide arch_this_cpu_write() implementation
> >    s390/percpu: Remove one and two byte this_cpu operation implementation
> >
> >   arch/s390/boot/alternative.c         |   7 +
> >   arch/s390/include/asm/alternative.h  |   5 +
> >   arch/s390/include/asm/entry-percpu.h |  76 ++++++++
> >   arch/s390/include/asm/lowcore.h      |   3 +-
> >   arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
> >   arch/s390/include/asm/ptrace.h       |   2 +
> >   arch/s390/kernel/alternative.c       |  25 ++-
> >   arch/s390/kernel/irq.c               |  26 ++-
> >   arch/s390/kernel/nmi.c               |   6 +
> >   arch/s390/kernel/traps.c             |   6 +
> >   10 files changed, 344 insertions(+), 61 deletions(-)
> >   create mode 100644 arch/s390/include/asm/entry-percpu.h
> >
> > base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8  
> 
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-20 22:34   ` David Laight
@ 2026-05-21  0:23     ` Yang Shi
  2026-05-21 10:17       ` David Laight
                         ` (2 more replies)
  0 siblings, 3 replies; 37+ messages in thread
From: Yang Shi @ 2026-05-21  0:23 UTC (permalink / raw)
  To: David Laight
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/20/26 3:34 PM, David Laight wrote:
> On Wed, 20 May 2026 11:42:36 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> Hi Heiko,
>>
>> Thanks for cc'ing me the patchset. Please see the below inline comments.
>>
>>
>> On 5/20/26 2:22 AM, Heiko Carstens wrote:
>>> v3:
>>> - Fix various typos [Juergen Christ]
>>>
>>> - Add missing kprobe detection / handling [Sashiko [3]]
>>>     [FWIW, this made me also aware of that the current general s390 kprobes
>>>      code seems to be racy against concurrent removal of a kprobe while a
>>>      probe hit on a different CPU. But that is a different story.]
>>>
>>> - Fix various minor findings [Sashiko [3]]
>>>
>>> - All of this might be dropped / exchanged in future in favor of the percpu
>>>     page table approach proposed by Yang Shi [4].
>> Thanks for mentioning my approach. I will do some comparison with rseq
>> in the following design details section of the cover letter.
>>
>>> [3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
>>> [4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/
>>>
>>> v2:
>>>
>>> - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
>>>     warnings
>>>
>>> - Add missing __packed attribute to insn structure [Sashiko [2]]
>>>
>>> - Fix inverted if condition [Sashiko [2]]
>>>
>>> - Add missing user_mode() check [Sashiko [2]]
>>>
>>> - Move percpu_entry() call in front of irqentry_enter() call in all
>>>     entry paths to avoid that potential this_cpu() operations overwrite
>>>     the not-yet saved percpu code section indicator  [Sashiko [2]]
>>>
>>> [2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
>>>
>>> v1:
>>>
>>> This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
>>>
>>> With the intended removal of PREEMPT_NONE this_cpu operations based on
>>> atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
>>> become more expensive: the preempt_disable() / preempt_enable() pairs are
>>> not optimized away anymore during compile time.
>>>
>>> In particular the conditional call to preempt_schedule_notrace() after
>>> preempt_enable() adds additional code and register pressure.
>>>
>>> To avoid this Peter suggested an in-kernel rseq approach. While this would
>>> certainly work, this series tries to come up with a solution which uses
>>> less instructions and doesn't require to repeat instruction sequences.
>>>
>>> The idea is that this_cpu operations based on atomic instructions are
>>> guarded with mvyi instructions:
>>>
>>> - The first mvyi instruction writes the register number, which contains
>>>     the percpu address variable to lowcore. This also indicates that a
>>>     percpu code section is executed.
>>>
>>> - The first instruction following the mvyi instruction must be the ag
>>>     instruction which adds the percpu offset to the percpu address register.
>>>
>>> - Afterwards the atomic percpu operation follows.
>>>
>>> - Then a second mvyi instruction writes a zero to lowcore, which indicates
>>>     the end of the percpu code section.
>>>
>>> - In case of an interrupt/exception/nmi the register number which was
>>>     written to lowcore is copied to the exception frame (pt_regs), and a zero
>>>     is written to lowcore.
>>>
>>> - On return to the previous context it is checked if a percpu code section
>>>     was executed (saved register number not zero), and if the process was
>>>     migrated to a different cpu. If the percpu offset was already added to
>>>     the percpu address register (instruction address does _not_ point to the
>>>     ag instruction) the content of the percpu address register is adjusted so
>>>     it points to percpu variable of the new cpu.
>> If I understand correctly, you replaced preempt_disable() and
>> preempt_enable() with seq begin and seg end, and seq begin and seq end
>> can be optimized by mvyi instruction on S390. So you just need a single
>> mvyi instruction for each instead of read-modify-write the seq count.
>>
>> But you need some extra overhead for context switch (save and restore
>> the seq count register) and need to check whether it is still on the
>> same cpu once resuming execution. And there is also penalty if it is
>> migrated to another CPU (need to rerun this_cpu ops).
> Not as I understand it.
> What happens is the context switch code 'corrupts' the register being
> used to access per-cpu data so that it is correct for the new cpu.
> The write of zero after the sequence is there to stop the register
> being corrupted outside of this code window.

Thanks for elaborating it. I misunderstood some nuance. I read the patch 
#2 commit message, now I think I understand how it works.

Borrowed the disassemble from patch #2 commit message:

   11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
   11a8ec:       b9 04 00 43             lgr     %r4,%r3
   11a8f0:       eb 00 43 c0 00 52       mviy    960,4
   11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
   11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
   11a902:       eb 00 03 c0 00 52       mviy    960,0
   11a908:       b9 08 00 25             agr     %r2,%r5
   11a90c        07 fe                   br      %r14

11a8f0 loads the percpu offset and mark the percpu code section begin, I 
believe this is needed with percpu page table too because we need load 
local percpu offset.
11a920 loads 0 to the register to mark the percpu code section end, this 
is not needed with percpu page table.

And you need to save the register at the irq/exception entry, then 
restore it at exit. But you also need to check whether migration happens 
or not, if it happens kernel needs to rewrite the register with correct 
percpu offset and needs to check whether the interrupted instruction is 
"ag". If it is "ag" instruction (11a8f6) , kernel needs to recalculate 
the percpu address, right?

It sounds a little bit hacky to me TBH and incur some extra overhead for 
"migration detection" and fixup.

>
> This really just means that you can (mostly) only do single accesses,
> since nothing stops pre-emption between the RW or an RMW sequence.
> Although you can probably do an increment of the preempt disable count
> because if you are preempted the value read will be zero.
>
>> So it seems have more overhead than the percpu page table approach IIUC.
>> We don't need all the steps with percpu page table. And there is no
>> penalty for migration.
> This code looks like it relies on 'page zero' already being percpu.
> So it probably isn't really that different.
> Some values like the 'preemption disable count' and 'current' could be
> (maybe are?) written into page zero to give fast access.

I don't quite get what you mean about 'page zero'.

>
> But I'm sure I remember that some cpu don't like having the same
> physical address at different virtual addresses (and not just those
> with VIVT caches like some sparc cpu).

Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is 
really percpu, so the mapping to the physical address belonging to 
another CPU should never pollute the current CPU's cache if I don't miss 
something.

> I'm sure code can end up accessing the current cpu's percpu data
> using the same address that other cpu use - there are definitely
> places where it needs that address.

No, it is not. In the percpu page table approach, we use different 
address for this_cpu_*() and per_cpu_ptr() which is mainly used to 
initialize percpu data for all CPUs.

Thanks,
Yang

> On x86-64 that means it reading the address from the array rather
> than just offsetting from %gs.
>
> -- David
>
>>> All of this seems to work, but of course it could still be broken since I
>>> missed some detail.
>>>
>>> In total this series results in a kernel text size reduction of ~106kb. The
>>> number of preempt_schedule_notrace() call sites is reduced from 7089 to
>>> 1577.
>> Yeah, both approaches can reduce the number of
>> preempt_schedule_notrace() call sites. And both approaches can reduce
>> the number of non-preemptible critical sections.
>>
>>> Note: this comes without any huge performance analysis, however all
>>> microbenchmarks confirmed that the new code is at least as fast as the
>>> old code, like expected.
>> I'm really interested in the benchmark number. I'm supposed percpu page
>> table approach should have better performance per my above analysis.
>>
>> Christopher Lameter is also interested in it, cc'ed him too.
>>
>> Thanks,
>> Yang
>>
>>> [1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net
>>>
>>> Heiko Carstens (9):
>>>     s390/alternatives: Add new ALT_TYPE_PERCPU type
>>>     s390/percpu: Infrastructure for more efficient this_cpu operations
>>>     s390/percpu: Add missing do { } while (0) constructs
>>>     s390/percpu: Use new percpu code section for arch_this_cpu_add()
>>>     s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
>>>     s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
>>>     s390/percpu: Provide arch_this_cpu_read() implementation
>>>     s390/percpu: Provide arch_this_cpu_write() implementation
>>>     s390/percpu: Remove one and two byte this_cpu operation implementation
>>>
>>>    arch/s390/boot/alternative.c         |   7 +
>>>    arch/s390/include/asm/alternative.h  |   5 +
>>>    arch/s390/include/asm/entry-percpu.h |  76 ++++++++
>>>    arch/s390/include/asm/lowcore.h      |   3 +-
>>>    arch/s390/include/asm/percpu.h       | 249 +++++++++++++++++++++------
>>>    arch/s390/include/asm/ptrace.h       |   2 +
>>>    arch/s390/kernel/alternative.c       |  25 ++-
>>>    arch/s390/kernel/irq.c               |  26 ++-
>>>    arch/s390/kernel/nmi.c               |   6 +
>>>    arch/s390/kernel/traps.c             |   6 +
>>>    10 files changed, 344 insertions(+), 61 deletions(-)
>>>    create mode 100644 arch/s390/include/asm/entry-percpu.h
>>>
>>> base-commit: 5200f5f493f79f14bbdc349e402a40dfb32f23c8
>>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21  0:23     ` Yang Shi
@ 2026-05-21 10:17       ` David Laight
  2026-05-21 16:57         ` Yang Shi
  2026-05-21 10:23       ` David Laight
  2026-05-21 10:37       ` Heiko Carstens
  2 siblings, 1 reply; 37+ messages in thread
From: David Laight @ 2026-05-21 10:17 UTC (permalink / raw)
  To: Yang Shi
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Wed, 20 May 2026 17:23:37 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/20/26 3:34 PM, David Laight wrote:
...
> >
> > But I'm sure I remember that some cpu don't like having the same
> > physical address at different virtual addresses (and not just those
> > with VIVT caches like some sparc cpu).  
> 
> Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is 
> really percpu, so the mapping to the physical address belonging to 
> another CPU should never pollute the current CPU's cache if I don't miss 
> something.
>
> > I'm sure code can end up accessing the current cpu's percpu data
> > using the same address that other cpu use - there are definitely
> > places where it needs that address.  
> 
> No, it is not. In the percpu page table approach, we use different 
> address for this_cpu_*() and per_cpu_ptr() which is mainly used to 
> initialize percpu data for all CPUs.

You missed something.

Look, for example, at kernel/locking/osq_lock.c
The code uses this_cpu_ptr() and then both dereferences the pointer
and writes it to places that other cpu will use.
It also uses per_cpu_ptr() to get an address it can use for the per-cpu
data of another cpu.
(That code all assumes preemption is disabled.)

-- David

> Thanks,
> Yang

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21 10:17       ` David Laight
@ 2026-05-21 16:57         ` Yang Shi
  2026-05-21 17:55           ` David Laight
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2026-05-21 16:57 UTC (permalink / raw)
  To: David Laight
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/21/26 3:17 AM, David Laight wrote:
> On Wed, 20 May 2026 17:23:37 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/20/26 3:34 PM, David Laight wrote:
> ...
>>> But I'm sure I remember that some cpu don't like having the same
>>> physical address at different virtual addresses (and not just those
>>> with VIVT caches like some sparc cpu).
>> Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is
>> really percpu, so the mapping to the physical address belonging to
>> another CPU should never pollute the current CPU's cache if I don't miss
>> something.
>>
>>> I'm sure code can end up accessing the current cpu's percpu data
>>> using the same address that other cpu use - there are definitely
>>> places where it needs that address.
>> No, it is not. In the percpu page table approach, we use different
>> address for this_cpu_*() and per_cpu_ptr() which is mainly used to
>> initialize percpu data for all CPUs.
> You missed something.
>
> Look, for example, at kernel/locking/osq_lock.c
> The code uses this_cpu_ptr() and then both dereferences the pointer
> and writes it to places that other cpu will use.
> It also uses per_cpu_ptr() to get an address it can use for the per-cpu
> data of another cpu.
> (That code all assumes preemption is disabled.)

this_cpu_ptr() uses different addresses for different CPUs. It is a 
special case, it is not due to VIVT, but because it may confuse list 
API. Because list API determines list is empty by comparing pointers 
(head->next == head). this_cpu_read/write/add/sub, etc, are fine.

And per_cpu_ptr() also uses different addresses for different CPUs.

The lwn article explained it. 
https://lwn.net/SubscriberLink/1073395/12c08f128e515809/

Thanks,
Yang

>
> -- David
>
>> Thanks,
>> Yang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21 16:57         ` Yang Shi
@ 2026-05-21 17:55           ` David Laight
  2026-05-21 20:46             ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: David Laight @ 2026-05-21 17:55 UTC (permalink / raw)
  To: Yang Shi
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Thu, 21 May 2026 09:57:37 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/21/26 3:17 AM, David Laight wrote:
> > On Wed, 20 May 2026 17:23:37 -0700
> > Yang Shi <yang@os.amperecomputing.com> wrote:
> >  
> >> On 5/20/26 3:34 PM, David Laight wrote:  
> > ...  
> >>> But I'm sure I remember that some cpu don't like having the same
> >>> physical address at different virtual addresses (and not just those
> >>> with VIVT caches like some sparc cpu).  
> >> Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is
> >> really percpu, so the mapping to the physical address belonging to
> >> another CPU should never pollute the current CPU's cache if I don't miss
> >> something.
> >>  
> >>> I'm sure code can end up accessing the current cpu's percpu data
> >>> using the same address that other cpu use - there are definitely
> >>> places where it needs that address.  
> >> No, it is not. In the percpu page table approach, we use different
> >> address for this_cpu_*() and per_cpu_ptr() which is mainly used to
> >> initialize percpu data for all CPUs.  
> > You missed something.
> >
> > Look, for example, at kernel/locking/osq_lock.c
> > The code uses this_cpu_ptr() and then both dereferences the pointer
> > and writes it to places that other cpu will use.
> > It also uses per_cpu_ptr() to get an address it can use for the per-cpu
> > data of another cpu.
> > (That code all assumes preemption is disabled.)  
> 
> this_cpu_ptr() uses different addresses for different CPUs. It is a 
> special case, it is not due to VIVT, but because it may confuse list 
> API. Because list API determines list is empty by comparing pointers 
> (head->next == head). this_cpu_read/write/add/sub, etc, are fine.

But you could quite easily get code that manages to mix accesses through
this_cpu_ptr() with direct accesses to per-cpu variables in the same
cache line.
I'm sure some arm cpu really don't like you doing that.
(But it is a foggy memory from somewhere.)

You can use per-cpu page tables, but it really only helps for a
few items.
Anything that is RMW (eg add on pretty much everything except x86)
either has to disable preemption or use a compare and exchange loop.
Variables like 'current' can be written into the per-cpu page table
data area by the process switch code (as I believe s390 does).

The 'trick' here will work for reading/writing values if you don't
care that the value read is stale (or might have been written to
the memory for a different cpu).
It might work for updating the preemption disable count - because
you can only be preempted while it is zero.

-- David

> 
> And per_cpu_ptr() also uses different addresses for different CPUs.
> 
> The lwn article explained it. 
> https://lwn.net/SubscriberLink/1073395/12c08f128e515809/
> 
> Thanks,
> Yang
> 
> >
> > -- David
> >  
> >> Thanks,
> >> Yang  
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21 17:55           ` David Laight
@ 2026-05-21 20:46             ` Yang Shi
  2026-05-21 22:13               ` David Laight
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2026-05-21 20:46 UTC (permalink / raw)
  To: David Laight
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/21/26 10:55 AM, David Laight wrote:
> On Thu, 21 May 2026 09:57:37 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/21/26 3:17 AM, David Laight wrote:
>>> On Wed, 20 May 2026 17:23:37 -0700
>>> Yang Shi <yang@os.amperecomputing.com> wrote:
>>>   
>>>> On 5/20/26 3:34 PM, David Laight wrote:
>>> ...
>>>>> But I'm sure I remember that some cpu don't like having the same
>>>>> physical address at different virtual addresses (and not just those
>>>>> with VIVT caches like some sparc cpu).
>>>> Yeah, VIVT cache doesn't like it due to cache alias. But the mapping is
>>>> really percpu, so the mapping to the physical address belonging to
>>>> another CPU should never pollute the current CPU's cache if I don't miss
>>>> something.
>>>>   
>>>>> I'm sure code can end up accessing the current cpu's percpu data
>>>>> using the same address that other cpu use - there are definitely
>>>>> places where it needs that address.
>>>> No, it is not. In the percpu page table approach, we use different
>>>> address for this_cpu_*() and per_cpu_ptr() which is mainly used to
>>>> initialize percpu data for all CPUs.
>>> You missed something.
>>>
>>> Look, for example, at kernel/locking/osq_lock.c
>>> The code uses this_cpu_ptr() and then both dereferences the pointer
>>> and writes it to places that other cpu will use.
>>> It also uses per_cpu_ptr() to get an address it can use for the per-cpu
>>> data of another cpu.
>>> (That code all assumes preemption is disabled.)
>> this_cpu_ptr() uses different addresses for different CPUs. It is a
>> special case, it is not due to VIVT, but because it may confuse list
>> API. Because list API determines list is empty by comparing pointers
>> (head->next == head). this_cpu_read/write/add/sub, etc, are fine.
> But you could quite easily get code that manages to mix accesses through
> this_cpu_ptr() with direct accesses to per-cpu variables in the same
> cache line.

I can see potential cache alias issue with VIVT cache with the below 
pattern:

for_each_cpu(cpu)
     per_cpu_ptr(cpu) <-- Initialize per cpu data

this_cpu_inc(current_cpu) <-- Inc the current cpu copy

this_cpu_inc() may see stale copy (uninitialized) if there is no cache 
flush after initialization.

> I'm sure some arm cpu really don't like you doing that.
> (But it is a foggy memory from somewhere.)

ARMv8 requires PIPT if I remember correctly. Some old 32 bit arm 
machines may have VIVT cache, but 32 bit arm is not the target user TBH. 
I can see some potential issues with VIVT cache, I don't think they are 
the target users and VIVT cache is rare now.

>
> You can use per-cpu page tables, but it really only helps for a
> few items.
> Anything that is RMW (eg add on pretty much everything except x86)
> either has to disable preemption or use a compare and exchange loop.

It only helps this_cpu ops because they use atomic instructions (at 
least on ARM64). __this_cpu ops still require preemption disabled. But 
the performance improvement is still impressive even though it just can 
help this_cpu ops.

> Variables like 'current' can be written into the per-cpu page table
> data area by the process switch code (as I believe s390 does).

That may be a useful usecase.

>
> The 'trick' here will work for reading/writing values if you don't
> care that the value read is stale (or might have been written to
> the memory for a different cpu).

If you don't care the stale value, you can just call __this_cpu_read().

Thanks,
Yang

> It might work for updating the preemption disable count - because
> you can only be preempted while it is zero.
>
> -- David
>
>> And per_cpu_ptr() also uses different addresses for different CPUs.
>>
>> The lwn article explained it.
>> https://lwn.net/SubscriberLink/1073395/12c08f128e515809/
>>
>> Thanks,
>> Yang
>>
>>> -- David
>>>   
>>>> Thanks,
>>>> Yang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21 20:46             ` Yang Shi
@ 2026-05-21 22:13               ` David Laight
  2026-05-21 23:41                 ` Yang Shi
  0 siblings, 1 reply; 37+ messages in thread
From: David Laight @ 2026-05-21 22:13 UTC (permalink / raw)
  To: Yang Shi
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Thu, 21 May 2026 13:46:25 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/21/26 10:55 AM, David Laight wrote:
...
> > The 'trick' here will work for reading/writing values if you don't
> > care that the value read is stale (or might have been written to
> > the memory for a different cpu).  
> 
> If you don't care the stale value, you can just call __this_cpu_read().

You can get an impossible value.
The generated code might be like this:
	this_cpu_data = xxx;
	preempt_disable_count = this_cpu_data->preempt_disable_count;
If the count was non-zero at the start you'll read the value from
the current cpu and all is fine.
But if the count is zero you can get preempted between the instructions,
the process now running on your 'old' cpu can increment the value
and you then read the new non-zero value.
That won't be good at all.

You can only use __this_cpu_read() for things that don't change.

The big problem with using per-cpu page tables is there will be absolutely
nothing stopping code taking the wrong address of a per-cpu variable and
saving it somewhere.
At the moment you have to use the helper so always get the global address.

-- David
	
> 
> Thanks,
> Yang

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21 22:13               ` David Laight
@ 2026-05-21 23:41                 ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2026-05-21 23:41 UTC (permalink / raw)
  To: David Laight
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/21/26 3:13 PM, David Laight wrote:
> On Thu, 21 May 2026 13:46:25 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/21/26 10:55 AM, David Laight wrote:
> ...
>>> The 'trick' here will work for reading/writing values if you don't
>>> care that the value read is stale (or might have been written to
>>> the memory for a different cpu).
>> If you don't care the stale value, you can just call __this_cpu_read().
> You can get an impossible value.
> The generated code might be like this:
> 	this_cpu_data = xxx;
> 	preempt_disable_count = this_cpu_data->preempt_disable_count;
> If the count was non-zero at the start you'll read the value from
> the current cpu and all is fine.
> But if the count is zero you can get preempted between the instructions,
> the process now running on your 'old' cpu can increment the value
> and you then read the new non-zero value.
> That won't be good at all.

TBH, I don't think this counts for "don't care the stale value".

>
> You can only use __this_cpu_read() for things that don't change.
>
> The big problem with using per-cpu page tables is there will be absolutely
> nothing stopping code taking the wrong address of a per-cpu variable and
> saving it somewhere.

Err... I'm lost. If you mean RW or RMW, atomic instructions are required 
for this_cpu ops. This is how this_cpu ops is implemented for ARM64 even 
though without percpu page table. If the operation is interrupted in the 
middle, the exclusion monitor will be cleared, the hardware will reload 
the value.

If you mean this_cpu_read() some value then this_cpu_write() to the same 
cpu, I don't think it can work as expected without disabling preemption 
for the whole code section even though we don't have percpu page table.

Thanks,
Yang

> At the moment you have to use the helper so always get the global address.
>
> -- David
> 	
>> Thanks,
>> Yang


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21  0:23     ` Yang Shi
  2026-05-21 10:17       ` David Laight
@ 2026-05-21 10:23       ` David Laight
  2026-05-21 17:48         ` Yang Shi
  2026-05-21 10:37       ` Heiko Carstens
  2 siblings, 1 reply; 37+ messages in thread
From: David Laight @ 2026-05-21 10:23 UTC (permalink / raw)
  To: Yang Shi
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Wed, 20 May 2026 17:23:37 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/20/26 3:34 PM, David Laight wrote:
..
> >> So it seems have more overhead than the percpu page table approach IIUC.
> >> We don't need all the steps with percpu page table. And there is no
> >> penalty for migration.  
> > This code looks like it relies on 'page zero' already being percpu.
> > So it probably isn't really that different.
> > Some values like the 'preemption disable count' and 'current' could be
> > (maybe are?) written into page zero to give fast access.  
> 
> I don't quite get what you mean about 'page zero'.

'page zero' is (at least for some cpu) the memory that can be accessed
using a small offset embedded in the instruction.
This is equivalent to using offsets from an 'always zero' %r0.

The code relies on the accesses to 960(%r0) being per-cpu.

-- David

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21 10:23       ` David Laight
@ 2026-05-21 17:48         ` Yang Shi
  0 siblings, 0 replies; 37+ messages in thread
From: Yang Shi @ 2026-05-21 17:48 UTC (permalink / raw)
  To: David Laight
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/21/26 3:23 AM, David Laight wrote:
> On Wed, 20 May 2026 17:23:37 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/20/26 3:34 PM, David Laight wrote:
> ..
>>>> So it seems have more overhead than the percpu page table approach IIUC.
>>>> We don't need all the steps with percpu page table. And there is no
>>>> penalty for migration.
>>> This code looks like it relies on 'page zero' already being percpu.
>>> So it probably isn't really that different.
>>> Some values like the 'preemption disable count' and 'current' could be
>>> (maybe are?) written into page zero to give fast access.
>> I don't quite get what you mean about 'page zero'.
> 'page zero' is (at least for some cpu) the memory that can be accessed
> using a small offset embedded in the instruction.
> This is equivalent to using offsets from an 'always zero' %r0.
>
> The code relies on the accesses to 960(%r0) being per-cpu.

Thank you. Heiko also elaborated it.

Yang

>
> -- David


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21  0:23     ` Yang Shi
  2026-05-21 10:17       ` David Laight
  2026-05-21 10:23       ` David Laight
@ 2026-05-21 10:37       ` Heiko Carstens
  2026-05-21 17:47         ` Yang Shi
  2 siblings, 1 reply; 37+ messages in thread
From: Heiko Carstens @ 2026-05-21 10:37 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Laight, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Wed, May 20, 2026 at 05:23:37PM -0700, Yang Shi wrote:
> > > If I understand correctly, you replaced preempt_disable() and
> > > preempt_enable() with seq begin and seg end, and seq begin and seq end
> > > can be optimized by mvyi instruction on S390. So you just need a single
> > > mvyi instruction for each instead of read-modify-write the seq count.
> > > 
> > > But you need some extra overhead for context switch (save and restore
> > > the seq count register) and need to check whether it is still on the
> > > same cpu once resuming execution. And there is also penalty if it is
> > > migrated to another CPU (need to rerun this_cpu ops).
> > Not as I understand it.
> > What happens is the context switch code 'corrupts' the register being
> > used to access per-cpu data so that it is correct for the new cpu.
> > The write of zero after the sequence is there to stop the register
> > being corrupted outside of this code window.
> 
> Thanks for elaborating it. I misunderstood some nuance. I read the patch #2
> commit message, now I think I understand how it works.

As background: s390 has so called prefix pages; the first two pages of every
CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
used by operating systems as percpu data area, which we use to have fast
access to e.g. the 'current' pointer, the pid, percpu_offset of the current
cpu, etc.

Helpful is also that for instructions which access memory with a base register
zero, its contents are assumed to be zero for address generation by the
hardware, regardless of its real contents. That is, the above

        ag %r4,952

is the short version of

        ag %r4,952(%r0)

The eight bytes at offset 952 of the current CPU's prefix page are added to
register 4. Real contents of register 0 are irrelevant for such address
generations; reducing register pressure.

> Borrowed the disassemble from patch #2 commit message:
> 
>   11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
>   11a8ec:       b9 04 00 43             lgr     %r4,%r3
>   11a8f0:       eb 00 43 c0 00 52       mviy    960,4
>   11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
>   11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
>   11a902:       eb 00 03 c0 00 52       mviy    960,0
>   11a908:       b9 08 00 25             agr     %r2,%r5
>   11a90c        07 fe                   br      %r14
> 
> 11a8f0 loads the percpu offset and mark the percpu code section begin, I
> believe this is needed with percpu page table too because we need load local
> percpu offset.

No, 11a8f0 _writes_ the base register number, which contains the percpu
address used by the percpu atomic op at 11a8fc, to offset 960 of the first
prefix page. It could also be written like

	mviy 960(%r0),4

maybe that makes it more obvious what happens. And yes, this marks the
beginning of a percpu code section. The percpu offset is added to register r4
at 11a8f6 with the ag instruction. This could also be written like

	ag %r4,952(%r0)

This reads the eight byte percpu_offset from offset 952 of the first prefix
page, and adds it to register r4.

> 11a920 loads 0 to the register to mark the percpu code section end, this is
> not needed with percpu page table.

I guess you meant 11a902. But yes, this marks the end of the percpu code
section. Just that this is not a register, but a memory location where is
written to.

> And you need to save the register at the irq/exception entry, then restore
> it at exit. But you also need to check whether migration happens or not, if
> it happens kernel needs to rewrite the register with correct percpu offset
> and needs to check whether the interrupted instruction is "ag".

Yes.

> If it is "ag" instruction (11a8f6) , kernel needs to recalculate the percpu
> address, right?

No, if it is within the percpu code section and it is _not_ the ag instruction,
the percpu base register needs to be adjusted (that's by the way a bug in
patch two, which has this logic inverted - my mistake).

> It sounds a little bit hacky to me TBH and incur some extra overhead for
> "migration detection" and fixup.

Sure, it is hacky, and the small overhead part is of course true.

Compared to the percpu page table proposal the two mviy instructions above
would go away, as well as the extra interrupt/exception overhead. Besides
that your proposal is way less hacky.

> > > So it seems have more overhead than the percpu page table approach IIUC.
> > > We don't need all the steps with percpu page table. And there is no
> > > penalty for migration.
> > This code looks like it relies on 'page zero' already being percpu.
> > So it probably isn't really that different.
> > Some values like the 'preemption disable count' and 'current' could be
> > (maybe are?) written into page zero to give fast access.
> 
> I don't quite get what you mean about 'page zero'.

Hopefully the above description with prefix pages explains it?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21 10:37       ` Heiko Carstens
@ 2026-05-21 17:47         ` Yang Shi
  2026-05-22  9:18           ` Heiko Carstens
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2026-05-21 17:47 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: David Laight, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/21/26 3:37 AM, Heiko Carstens wrote:
> On Wed, May 20, 2026 at 05:23:37PM -0700, Yang Shi wrote:
>>>> If I understand correctly, you replaced preempt_disable() and
>>>> preempt_enable() with seq begin and seg end, and seq begin and seq end
>>>> can be optimized by mvyi instruction on S390. So you just need a single
>>>> mvyi instruction for each instead of read-modify-write the seq count.
>>>>
>>>> But you need some extra overhead for context switch (save and restore
>>>> the seq count register) and need to check whether it is still on the
>>>> same cpu once resuming execution. And there is also penalty if it is
>>>> migrated to another CPU (need to rerun this_cpu ops).
>>> Not as I understand it.
>>> What happens is the context switch code 'corrupts' the register being
>>> used to access per-cpu data so that it is correct for the new cpu.
>>> The write of zero after the sequence is there to stop the register
>>> being corrupted outside of this code window.
>> Thanks for elaborating it. I misunderstood some nuance. I read the patch #2
>> commit message, now I think I understand how it works.
> As background: s390 has so called prefix pages; the first two pages of every
> CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
> used by operating systems as percpu data area, which we use to have fast
> access to e.g. the 'current' pointer, the pid, percpu_offset of the current
> cpu, etc.
>
> Helpful is also that for instructions which access memory with a base register
> zero, its contents are assumed to be zero for address generation by the
> hardware, regardless of its real contents. That is, the above
>
>          ag %r4,952
>
> is the short version of
>
>          ag %r4,952(%r0)
>
> The eight bytes at offset 952 of the current CPU's prefix page are added to
> register 4. Real contents of register 0 are irrelevant for such address
> generations; reducing register pressure.

Aha, I see. So the prefix pages are some special memory?

>
>> Borrowed the disassemble from patch #2 commit message:
>>
>>    11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
>>    11a8ec:       b9 04 00 43             lgr     %r4,%r3
>>    11a8f0:       eb 00 43 c0 00 52       mviy    960,4
>>    11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
>>    11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
>>    11a902:       eb 00 03 c0 00 52       mviy    960,0
>>    11a908:       b9 08 00 25             agr     %r2,%r5
>>    11a90c        07 fe                   br      %r14
>>
>> 11a8f0 loads the percpu offset and mark the percpu code section begin, I
>> believe this is needed with percpu page table too because we need load local
>> percpu offset.
> No, 11a8f0 _writes_ the base register number, which contains the percpu
> address used by the percpu atomic op at 11a8fc, to offset 960 of the first
> prefix page. It could also be written like
>
> 	mviy 960(%r0),4
>
> maybe that makes it more obvious what happens. And yes, this marks the
> beginning of a percpu code section. The percpu offset is added to register r4
> at 11a8f6 with the ag instruction. This could also be written like
>
> 	ag %r4,952(%r0)
>
> This reads the eight byte percpu_offset from offset 952 of the first prefix
> page, and adds it to register r4.

Got it.

>
>> 11a920 loads 0 to the register to mark the percpu code section end, this is
>> not needed with percpu page table.
> I guess you meant 11a902. But yes, this marks the end of the percpu code
> section. Just that this is not a register, but a memory location where is
> written to.

So both mviy instructions actually do memory store?

>
>> And you need to save the register at the irq/exception entry, then restore
>> it at exit. But you also need to check whether migration happens or not, if
>> it happens kernel needs to rewrite the register with correct percpu offset
>> and needs to check whether the interrupted instruction is "ag".
> Yes.
>
>> If it is "ag" instruction (11a8f6) , kernel needs to recalculate the percpu
>> address, right?
> No, if it is within the percpu code section and it is _not_ the ag instruction,
> the percpu base register needs to be adjusted (that's by the way a bug in
> patch two, which has this logic inverted - my mistake).

Yeah, I see.

>
>> It sounds a little bit hacky to me TBH and incur some extra overhead for
>> "migration detection" and fixup.
> Sure, it is hacky, and the small overhead part is of course true.
>
> Compared to the percpu page table proposal the two mviy instructions above
> would go away, as well as the extra interrupt/exception overhead. Besides
> that your proposal is way less hacky.

It would be great if we can compare the performance number for the two 
approaches. rseq has been discussed for ARM64, but it seems too 
expensive and just move the overhead to somewhere else.

>
>>>> So it seems have more overhead than the percpu page table approach IIUC.
>>>> We don't need all the steps with percpu page table. And there is no
>>>> penalty for migration.
>>> This code looks like it relies on 'page zero' already being percpu.
>>> So it probably isn't really that different.
>>> Some values like the 'preemption disable count' and 'current' could be
>>> (maybe are?) written into page zero to give fast access.
>> I don't quite get what you mean about 'page zero'.
> Hopefully the above description with prefix pages explains it?

Yes, definitely, thank you so much for elaborating it.

Regards,
Yang



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-21 17:47         ` Yang Shi
@ 2026-05-22  9:18           ` Heiko Carstens
  2026-05-27 19:09             ` Christoph Lameter (Ampere)
  2026-05-27 23:44             ` Yang Shi
  0 siblings, 2 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-22  9:18 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Laight, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Thu, May 21, 2026 at 10:47:49AM -0700, Yang Shi wrote:
> > As background: s390 has so called prefix pages; the first two pages of every
> > CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
> > used by operating systems as percpu data area, which we use to have fast
> > access to e.g. the 'current' pointer, the pid, percpu_offset of the current
> > cpu, etc.
> > 
> > Helpful is also that for instructions which access memory with a base register
> > zero, its contents are assumed to be zero for address generation by the
> > hardware, regardless of its real contents. That is, the above
> > 
> >          ag %r4,952
> > 
> > is the short version of
> > 
> >          ag %r4,952(%r0)
> > 
> > The eight bytes at offset 952 of the current CPU's prefix page are added to
> > register 4. Real contents of register 0 are irrelevant for such address
> > generations; reducing register pressure.
> 
> Aha, I see. So the prefix pages are some special memory?

No, it is regular memory. The CPU has a special "prefix register". If
that is set to an address not equal to zero all memory accesses to the
first two pages will be transparently redirected to the 8k memory area
specified with that register.

E.g. the prefix register contains the value 0x10000. If then a memory
access to address 0x400 happens the CPU will transparently turn that
into a memory access to address 0x10400. Or in other words, that is a
small per cpu memory area mechanism provided by the architecture.

> > >    11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
> > >    11a8ec:       b9 04 00 43             lgr     %r4,%r3
> > >    11a8f0:       eb 00 43 c0 00 52       mviy    960,4
> > >    11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
> > >    11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
> > >    11a902:       eb 00 03 c0 00 52       mviy    960,0
> > >    11a908:       b9 08 00 25             agr     %r2,%r5
> > >    11a90c        07 fe                   br      %r14

...

> > > 11a920 loads 0 to the register to mark the percpu code section end, this is
> > > not needed with percpu page table.
> > I guess you meant 11a902. But yes, this marks the end of the percpu code
> > section. Just that this is not a register, but a memory location where is
> > written to.
> 
> So both mviy instructions actually do memory store?

Yes.

> > > It sounds a little bit hacky to me TBH and incur some extra overhead for
> > > "migration detection" and fixup.
> > Sure, it is hacky, and the small overhead part is of course true.
> > 
> > Compared to the percpu page table proposal the two mviy instructions above
> > would go away, as well as the extra interrupt/exception overhead. Besides
> > that your proposal is way less hacky.
> 
> It would be great if we can compare the performance number for the two
> approaches. rseq has been discussed for ARM64, but it seems too expensive
> and just move the overhead to somewhere else.

I tried to implement the proposed rseq/kseq, but the required inline
assemblies resulted in code which was larger than what we have now for
s390.

Also with the current proposal I only did some quick micro benchmarks,
which resulted in 0-1% improvement, which is in the expected range.

It is amazing to see the performance improvements you see on arm64, however
I believe that is mainly because of the large amount of code which is
generated by the arm64 implementations of the preempt primitives
__preempt_count_add() and __preempt_count_dec_and_test().

That's a big difference to s390: for both primitives the result is a single
instruction.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-22  9:18           ` Heiko Carstens
@ 2026-05-27 19:09             ` Christoph Lameter (Ampere)
  2026-05-27 20:38               ` Yang Shi
  2026-05-27 23:44             ` Yang Shi
  1 sibling, 1 reply; 37+ messages in thread
From: Christoph Lameter (Ampere) @ 2026-05-27 19:09 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Yang Shi, David Laight, Alexander Gordeev, Sven Schnelle,
	Vasily Gorbik, Christian Borntraeger, Juergen Christ,
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Fri, 22 May 2026, Heiko Carstens wrote:

> Also with the current proposal I only did some quick micro benchmarks,
> which resulted in 0-1% improvement, which is in the expected range.
>
> It is amazing to see the performance improvements you see on arm64, however
> I believe that is mainly because of the large amount of code which is
> generated by the arm64 implementations of the preempt primitives
> __preempt_count_add() and __preempt_count_dec_and_test().

The code is generated if you have no arch specific per cpu mechanism and
preemption must be supported. We have now the situation that we cannot
switch off preemption support anymore.

It seem that S390 has this mechanism in a small way and therefore can
avoid the preempt enable/disable.

It is not the quantity of code here. The preempt enable/disable can only
be avoided if there is a single instruction doing the per cpu operation. A
single instruction cannot be interupted and therefore is preemption safe.

> That's a big difference to s390: for both primitives the result is a single
> instruction.

Ok then you can already use single instructions like x86 and will not have
preempt enable/disable overhead.

I am not sure what David Laight's code is supposed to do. Seems weird to
me.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-27 19:09             ` Christoph Lameter (Ampere)
@ 2026-05-27 20:38               ` Yang Shi
  2026-05-28  8:36                 ` David Laight
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2026-05-27 20:38 UTC (permalink / raw)
  To: Christoph Lameter (Ampere), Heiko Carstens
  Cc: David Laight, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Peter Zijlstra,
	Shrikanth Hegde, linux-kernel, linux-s390



On 5/27/26 12:09 PM, Christoph Lameter (Ampere) wrote:
> On Fri, 22 May 2026, Heiko Carstens wrote:
>
>> Also with the current proposal I only did some quick micro benchmarks,
>> which resulted in 0-1% improvement, which is in the expected range.
>>
>> It is amazing to see the performance improvements you see on arm64, however
>> I believe that is mainly because of the large amount of code which is
>> generated by the arm64 implementations of the preempt primitives
>> __preempt_count_add() and __preempt_count_dec_and_test().
> The code is generated if you have no arch specific per cpu mechanism and
> preemption must be supported. We have now the situation that we cannot
> switch off preemption support anymore.
>
> It seem that S390 has this mechanism in a small way and therefore can
> avoid the preempt enable/disable.
>
> It is not the quantity of code here. The preempt enable/disable can only
> be avoided if there is a single instruction doing the per cpu operation. A
> single instruction cannot be interupted and therefore is preemption safe.
>
>
>> That's a big difference to s390: for both primitives the result is a single
>> instruction.
> Ok then you can already use single instructions like x86 and will not have
> preempt enable/disable overhead.

I don't think S390 can do it in one single instruction. IIUC, Heiko 
means preempt_enable/disable is a single instruction on s390, but it is 
RMW on ARM64 (3 instructions for each).

Thanks,
Yang

>
> I am not sure what David Laight's code is supposed to do. Seems weird to
> me.
>


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-27 20:38               ` Yang Shi
@ 2026-05-28  8:36                 ` David Laight
  0 siblings, 0 replies; 37+ messages in thread
From: David Laight @ 2026-05-28  8:36 UTC (permalink / raw)
  To: Yang Shi
  Cc: Christoph Lameter (Ampere), Heiko Carstens, Alexander Gordeev,
	Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
	Juergen Christ, Peter Zijlstra, Shrikanth Hegde, linux-kernel,
	linux-s390

On Wed, 27 May 2026 13:38:06 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/27/26 12:09 PM, Christoph Lameter (Ampere) wrote:
> > On Fri, 22 May 2026, Heiko Carstens wrote:
> >  
> >> Also with the current proposal I only did some quick micro benchmarks,
> >> which resulted in 0-1% improvement, which is in the expected range.
> >>
> >> It is amazing to see the performance improvements you see on arm64, however
> >> I believe that is mainly because of the large amount of code which is
> >> generated by the arm64 implementations of the preempt primitives
> >> __preempt_count_add() and __preempt_count_dec_and_test().  
> > The code is generated if you have no arch specific per cpu mechanism and
> > preemption must be supported. We have now the situation that we cannot
> > switch off preemption support anymore.
> >
> > It seem that S390 has this mechanism in a small way and therefore can
> > avoid the preempt enable/disable.
> >
> > It is not the quantity of code here. The preempt enable/disable can only
> > be avoided if there is a single instruction doing the per cpu operation. A
> > single instruction cannot be interupted and therefore is preemption safe.
> >
> >  
> >> That's a big difference to s390: for both primitives the result is a single
> >> instruction.  
> > Ok then you can already use single instructions like x86 and will not have
> > preempt enable/disable overhead.  
> 
> I don't think S390 can do it in one single instruction. IIUC, Heiko 
> means preempt_enable/disable is a single instruction on s390, but it is 
> RMW on ARM64 (3 instructions for each).

The proposed 'trick' for s390 is a sort of a temporary global register
that accesses the per-cpu data.
s390 seems to have it relatively easy because of the 8k of per-cpu data
and the atomic add/and/or with memory.
x86 has two global registers (%fs and %gs) as well as atomic add.

But AFAICT arm64 (and probably others) has nothing that helps.
Allocating a global register for the per-cpu data has been suggested.
Using the mmu to generate a page of cpu-private data would make
the preempt primitives much cheaper without the difficulties of
having two addresses for per-cpu data and any related cache issues.

> 
> Thanks,
> Yang
> 
> >
> > I am not sure what David Laight's code is supposed to do. Seems weird to
> > me.

I've forgotten what I suggested, was probably broken...

-- David

> >  
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-22  9:18           ` Heiko Carstens
  2026-05-27 19:09             ` Christoph Lameter (Ampere)
@ 2026-05-27 23:44             ` Yang Shi
  2026-05-28  9:03               ` David Laight
  2026-05-28 14:14               ` Heiko Carstens
  1 sibling, 2 replies; 37+ messages in thread
From: Yang Shi @ 2026-05-27 23:44 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: David Laight, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/22/26 2:18 AM, Heiko Carstens wrote:
> On Thu, May 21, 2026 at 10:47:49AM -0700, Yang Shi wrote:
>>> As background: s390 has so called prefix pages; the first two pages of every
>>> CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
>>> used by operating systems as percpu data area, which we use to have fast
>>> access to e.g. the 'current' pointer, the pid, percpu_offset of the current
>>> cpu, etc.
>>>
>>> Helpful is also that for instructions which access memory with a base register
>>> zero, its contents are assumed to be zero for address generation by the
>>> hardware, regardless of its real contents. That is, the above
>>>
>>>           ag %r4,952
>>>
>>> is the short version of
>>>
>>>           ag %r4,952(%r0)
>>>
>>> The eight bytes at offset 952 of the current CPU's prefix page are added to
>>> register 4. Real contents of register 0 are irrelevant for such address
>>> generations; reducing register pressure.
>> Aha, I see. So the prefix pages are some special memory?
> No, it is regular memory. The CPU has a special "prefix register". If
> that is set to an address not equal to zero all memory accesses to the
> first two pages will be transparently redirected to the 8k memory area
> specified with that register.
>
> E.g. the prefix register contains the value 0x10000. If then a memory
> access to address 0x400 happens the CPU will transparently turn that
> into a memory access to address 0x10400. Or in other words, that is a
> small per cpu memory area mechanism provided by the architecture.

Got it.

>>>>     11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
>>>>     11a8ec:       b9 04 00 43             lgr     %r4,%r3
>>>>     11a8f0:       eb 00 43 c0 00 52       mviy    960,4
>>>>     11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
>>>>     11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
>>>>     11a902:       eb 00 03 c0 00 52       mviy    960,0
>>>>     11a908:       b9 08 00 25             agr     %r2,%r5
>>>>     11a90c        07 fe                   br      %r14
> ...
>
>>>> 11a920 loads 0 to the register to mark the percpu code section end, this is
>>>> not needed with percpu page table.
>>> I guess you meant 11a902. But yes, this marks the end of the percpu code
>>> section. Just that this is not a register, but a memory location where is
>>> written to.
>> So both mviy instructions actually do memory store?
> Yes.
>
>>>> It sounds a little bit hacky to me TBH and incur some extra overhead for
>>>> "migration detection" and fixup.
>>> Sure, it is hacky, and the small overhead part is of course true.
>>>
>>> Compared to the percpu page table proposal the two mviy instructions above
>>> would go away, as well as the extra interrupt/exception overhead. Besides
>>> that your proposal is way less hacky.
>> It would be great if we can compare the performance number for the two
>> approaches. rseq has been discussed for ARM64, but it seems too expensive
>> and just move the overhead to somewhere else.
> I tried to implement the proposed rseq/kseq, but the required inline
> assemblies resulted in code which was larger than what we have now for
> s390.
>
> Also with the current proposal I only did some quick micro benchmarks,
> which resulted in 0-1% improvement, which is in the expected range.
>
> It is amazing to see the performance improvements you see on arm64, however
> I believe that is mainly because of the large amount of code which is
> generated by the arm64 implementations of the preempt primitives
> __preempt_count_add() and __preempt_count_dec_and_test().

Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one 
instruction is used to load current pointer, the other 3 instructions 
are used to RMW preempt_count). So I can remove 8 instructions in total 
for a single this_cpu ops. That's a lot. Given this_cpu ops are heavily 
used in kernel, we end up running fewer instructions and having better 
icache hit rate, the better icache hit rate also helps reduce cross node 
traffic for 2-socket system.

> That's a big difference to s390: for both primitives the result is a single
> instruction.

Yeah, I see. S390 should have the similar benefits theoretically, but 
may not have that significant gains.

Thanks,
Yang




^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-27 23:44             ` Yang Shi
@ 2026-05-28  9:03               ` David Laight
  2026-05-28 19:19                 ` Yang Shi
  2026-05-28 14:14               ` Heiko Carstens
  1 sibling, 1 reply; 37+ messages in thread
From: David Laight @ 2026-05-28  9:03 UTC (permalink / raw)
  To: Yang Shi
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Wed, 27 May 2026 16:44:31 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/22/26 2:18 AM, Heiko Carstens wrote:
...
> > It is amazing to see the performance improvements you see on arm64, however
> > I believe that is mainly because of the large amount of code which is
> > generated by the arm64 implementations of the preempt primitives
> > __preempt_count_add() and __preempt_count_dec_and_test().  
> 
> Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one 
> instruction is used to load current pointer, the other 3 instructions 
> are used to RMW preempt_count). So I can remove 8 instructions in total 
> for a single this_cpu ops. That's a lot. Given this_cpu ops are heavily 
> used in kernel, we end up running fewer instructions and having better 
> icache hit rate, the better icache hit rate also helps reduce cross node 
> traffic for 2-socket system.

Is 'current' kept in a cpu hardware register?
With the process switch code updating current->per_cpu_data.

That might mean that you can access per-cpu data without disabling
preemption (for single ops) using the same technique as s390.
So something like:
	mov %ra, current
	movb per_cpu_reg(%ra), $b
	mov %rb, per_cpu_data(%ra)
	// per-cpu access using %rb, process switch code will update %rb
	movb per_cpu_reg(%ra), $255

An add will need to use a cmpxchg loop.
For simplicity use a fixed register for %rb.

-- David

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-28  9:03               ` David Laight
@ 2026-05-28 19:19                 ` Yang Shi
  2026-05-28 20:34                   ` David Laight
  0 siblings, 1 reply; 37+ messages in thread
From: Yang Shi @ 2026-05-28 19:19 UTC (permalink / raw)
  To: David Laight
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/28/26 2:03 AM, David Laight wrote:
> On Wed, 27 May 2026 16:44:31 -0700
> Yang Shi <yang@os.amperecomputing.com> wrote:
>
>> On 5/22/26 2:18 AM, Heiko Carstens wrote:
> ...
>>> It is amazing to see the performance improvements you see on arm64, however
>>> I believe that is mainly because of the large amount of code which is
>>> generated by the arm64 implementations of the preempt primitives
>>> __preempt_count_add() and __preempt_count_dec_and_test().
>> Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
>> instruction is used to load current pointer, the other 3 instructions
>> are used to RMW preempt_count). So I can remove 8 instructions in total
>> for a single this_cpu ops. That's a lot. Given this_cpu ops are heavily
>> used in kernel, we end up running fewer instructions and having better
>> icache hit rate, the better icache hit rate also helps reduce cross node
>> traffic for 2-socket system.
> Is 'current' kept in a cpu hardware register?

Yes, sp_el0. But it is a special register, we need move it to a general 
register before any ARM64 instructions can access it.

> With the process switch code updating current->per_cpu_data.
>
> That might mean that you can access per-cpu data without disabling
> preemption (for single ops) using the same technique as s390.
> So something like:
> 	mov %ra, current
> 	movb per_cpu_reg(%ra), $b
> 	mov %rb, per_cpu_data(%ra)
> 	// per-cpu access using %rb, process switch code will update %rb
> 	movb per_cpu_reg(%ra), $255
>
> An add will need to use a cmpxchg loop.
> For simplicity use a fixed register for %rb.

TBH, I can't say I fully understand what you proposed. But it sounds 
like this one 
https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?id=84ee5f23f93d4a650e828f831da9ed29c54623c5

Thanks,
Yang

>
> -- David


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-28 19:19                 ` Yang Shi
@ 2026-05-28 20:34                   ` David Laight
  0 siblings, 0 replies; 37+ messages in thread
From: David Laight @ 2026-05-28 20:34 UTC (permalink / raw)
  To: Yang Shi
  Cc: Heiko Carstens, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Thu, 28 May 2026 12:19:43 -0700
Yang Shi <yang@os.amperecomputing.com> wrote:

> On 5/28/26 2:03 AM, David Laight wrote:
> > On Wed, 27 May 2026 16:44:31 -0700
> > Yang Shi <yang@os.amperecomputing.com> wrote:
> >  
> >> On 5/22/26 2:18 AM, Heiko Carstens wrote:  
> > ...  
> >>> It is amazing to see the performance improvements you see on arm64, however
> >>> I believe that is mainly because of the large amount of code which is
> >>> generated by the arm64 implementations of the preempt primitives
> >>> __preempt_count_add() and __preempt_count_dec_and_test().  
> >> Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
> >> instruction is used to load current pointer, the other 3 instructions
> >> are used to RMW preempt_count). So I can remove 8 instructions in total
> >> for a single this_cpu ops. That's a lot. Given this_cpu ops are heavily
> >> used in kernel, we end up running fewer instructions and having better
> >> icache hit rate, the better icache hit rate also helps reduce cross node
> >> traffic for 2-socket system.  
> > Is 'current' kept in a cpu hardware register?  
> 
> Yes, sp_el0. But it is a special register, we need move it to a general 
> register before any ARM64 instructions can access it.

That is what I thought.
(Hmm... isn't that the userspace stack register?)

> 
> > With the process switch code updating current->per_cpu_data.
> >
> > That might mean that you can access per-cpu data without disabling
> > preemption (for single ops) using the same technique as s390.
> > So something like:
> > 	mov %ra, current
> > 	movb per_cpu_reg(%ra), $b
> > 	mov %rb, per_cpu_data(%ra)
> > 	// per-cpu access using %rb, process switch code will update %rb
> > 	movb per_cpu_reg(%ra), $255
> >
> > An add will need to use a cmpxchg loop.
> > For simplicity use a fixed register for %rb.  
> 
> TBH, I can't say I fully understand what you proposed. But it sounds 
> like this one 
> https://git.kernel.org/pub/scm/linux/kernel/git/mark/linux.git/commit/?id=84ee5f23f93d4a650e828f831da9ed29c54623c5

Not really, although it does describe one way to do an atomic add.
For things like per-cpu stats you don't really care if the
'wrong' stats are changed, but the R and W (of the RMW) need to go to the
same address.

That proposal reserved a 'general register' for the per-cpu data all the time.

Like the s390 code this all started with, I'm suggesting that the code
tells the context switch code that a specific register contains the base
of the per-cpu data, on context switch that register is changed to be the
base address of the per-cpu data for the new cpu.
So outside of the code accessing per-cpu data the register can be used normally.

I don't think you need to look at the opcode if the process switch (the s390
code did), even checking that %rb (above) contains the per-cpu data address
is really optional.

I suggested using a fixed register meaning 'always use the same register'
to save the difficultly of generating $n from %rn.

-- David



 


> 
> Thanks,
> Yang
> 
> >
> > -- David  
> 


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-27 23:44             ` Yang Shi
  2026-05-28  9:03               ` David Laight
@ 2026-05-28 14:14               ` Heiko Carstens
  2026-05-28 17:14                 ` David Laight
  2026-05-28 18:39                 ` Yang Shi
  1 sibling, 2 replies; 37+ messages in thread
From: Heiko Carstens @ 2026-05-28 14:14 UTC (permalink / raw)
  To: Yang Shi
  Cc: David Laight, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Wed, May 27, 2026 at 04:44:31PM -0700, Yang Shi wrote:
> On 5/22/26 2:18 AM, Heiko Carstens wrote:
> > It is amazing to see the performance improvements you see on arm64, however
> > I believe that is mainly because of the large amount of code which is
> > generated by the arm64 implementations of the preempt primitives
> > __preempt_count_add() and __preempt_count_dec_and_test().
> 
> Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
> instruction is used to load current pointer, the other 3 instructions are
> used to RMW preempt_count). So I can remove 8 instructions in total for a
> single this_cpu ops. That's a lot. Given this_cpu ops are heavily used in
> kernel, we end up running fewer instructions and having better icache hit
> rate, the better icache hit rate also helps reduce cross node traffic for
> 2-socket system.

You save more. Look at arm64's __preempt_count_dec_and_test()
implementation: it is RMW + compare + READ + compare.

preempt_enable() generates this code, where x1 seems to contain the
preempt_count pointer:

  80:   f9400420        ldr     x0, [x1, #8]
  84:   d1000400        sub     x0, x0, #0x1
  88:   b9000820        str     w0, [x1, #8]
  8c:   b4000060        cbz     x0, 98 <bar+0x58>
  90:   f9400420        ldr     x0, [x1, #8]
  94:   b5000040        cbnz    x0, 9c <bar+0x5c>
  98:   94000000        bl      0 <preempt_schedule_notrace>
  9c:   ...

I assume arm64's instruction set does not allow for better code for
__preempt_count_dec_and_test() if you would fold the need_resched bit into
preempt_count and use atomic instructions + inline assembly with flag
output operands when modifying preempt_count.
As of now only x86 and s390 are doing that.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-28 14:14               ` Heiko Carstens
@ 2026-05-28 17:14                 ` David Laight
  2026-05-28 18:39                 ` Yang Shi
  1 sibling, 0 replies; 37+ messages in thread
From: David Laight @ 2026-05-28 17:14 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Yang Shi, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390

On Thu, 28 May 2026 16:14:41 +0200
Heiko Carstens <hca@linux.ibm.com> wrote:

> On Wed, May 27, 2026 at 04:44:31PM -0700, Yang Shi wrote:
> > On 5/22/26 2:18 AM, Heiko Carstens wrote:  
> > > It is amazing to see the performance improvements you see on arm64, however
> > > I believe that is mainly because of the large amount of code which is
> > > generated by the arm64 implementations of the preempt primitives
> > > __preempt_count_add() and __preempt_count_dec_and_test().  
> > 
> > Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
> > instruction is used to load current pointer, the other 3 instructions are
> > used to RMW preempt_count). So I can remove 8 instructions in total for a
> > single this_cpu ops. That's a lot. Given this_cpu ops are heavily used in
> > kernel, we end up running fewer instructions and having better icache hit
> > rate, the better icache hit rate also helps reduce cross node traffic for
> > 2-socket system.  
> 
> You save more. Look at arm64's __preempt_count_dec_and_test()
> implementation: it is RMW + compare + READ + compare.
> 
> preempt_enable() generates this code, where x1 seems to contain the
> preempt_count pointer:
> 
>   80:   f9400420        ldr     x0, [x1, #8]
>   84:   d1000400        sub     x0, x0, #0x1
>   88:   b9000820        str     w0, [x1, #8]
>   8c:   b4000060        cbz     x0, 98 <bar+0x58>
>   90:   f9400420        ldr     x0, [x1, #8]
>   94:   b5000040        cbnz    x0, 9c <bar+0x5c>
>   98:   94000000        bl      0 <preempt_schedule_notrace>
>   9c:   ...
> 
> I assume arm64's instruction set does not allow for better code for
> __preempt_count_dec_and_test() if you would fold the need_resched bit into
> preempt_count and use atomic instructions + inline assembly with flag
> output operands when modifying preempt_count.
> As of now only x86 and s390 are doing that.

I think arm64 only has single instruction exchanges - which makes life hard.
But it has to be possible to do better than the above.
The 'normal' path (not nested, no preemption) seems to execute everything
except the 'bl'.
All the 'not preempted' paths have a taken forwards conditional branch
that stands a fair chance of being mispredicted.
There is also the 32bit write followed by a 64bit read of the same address.
That will 'break' any logic that does 'store to load' forwarding (where
the read is satisfied from the store buffer) and add more delays.
That means I think you need something like:
	ldr	w0, [x1, #8]
	sub	x0, x0, #1
	str	w0, [x1, #8]
	ldr	w2, [x1, #12]
	or	x0, x0, x2 
	cbz	x0, 1f
2:
# sometime later.
1:
	bl	preempt_schedule:
	b	2b

But the last arm system I wrote asm for was a strongarm!
And the book I have is from 2004.

The definition:
#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) \
		__preempt_schedule(); \
} while (0) 
doesn't really help.
gcc tends to ignore the unlikely() when the other path is empty
and just generates a forwards branch around the call.
Forcing it to generate both parts of the if can help.
So:
#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) \
		__preempt_schedule(); \
	else \
		asm (""); \
} while (0)
can be enough to force a conditional branch to the call.

-- David



^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 0/9] s390: Improve this_cpu operations
  2026-05-28 14:14               ` Heiko Carstens
  2026-05-28 17:14                 ` David Laight
@ 2026-05-28 18:39                 ` Yang Shi
  1 sibling, 0 replies; 37+ messages in thread
From: Yang Shi @ 2026-05-28 18:39 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: David Laight, Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, Christoph Lameter (Ampere),
	Peter Zijlstra, Shrikanth Hegde, linux-kernel, linux-s390



On 5/28/26 7:14 AM, Heiko Carstens wrote:
> On Wed, May 27, 2026 at 04:44:31PM -0700, Yang Shi wrote:
>> On 5/22/26 2:18 AM, Heiko Carstens wrote:
>>> It is amazing to see the performance improvements you see on arm64, however
>>> I believe that is mainly because of the large amount of code which is
>>> generated by the arm64 implementations of the preempt primitives
>>> __preempt_count_add() and __preempt_count_dec_and_test().
>> Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
>> instruction is used to load current pointer, the other 3 instructions are
>> used to RMW preempt_count). So I can remove 8 instructions in total for a
>> single this_cpu ops. That's a lot. Given this_cpu ops are heavily used in
>> kernel, we end up running fewer instructions and having better icache hit
>> rate, the better icache hit rate also helps reduce cross node traffic for
>> 2-socket system.
> You save more. Look at arm64's __preempt_count_dec_and_test()
> implementation: it is RMW + compare + READ + compare.

Yes

>
> preempt_enable() generates this code, where x1 seems to contain the
> preempt_count pointer:
>
>    80:   f9400420        ldr     x0, [x1, #8]
>    84:   d1000400        sub     x0, x0, #0x1
>    88:   b9000820        str     w0, [x1, #8]
>    8c:   b4000060        cbz     x0, 98 <bar+0x58>
>    90:   f9400420        ldr     x0, [x1, #8]
>    94:   b5000040        cbnz    x0, 9c <bar+0x5c>
>    98:   94000000        bl      0 <preempt_schedule_notrace>
>    9c:   ...
>
> I assume arm64's instruction set does not allow for better code for
> __preempt_count_dec_and_test() if you would fold the need_resched bit into
> preempt_count and use atomic instructions + inline assembly with flag
> output operands when modifying preempt_count.
> As of now only x86 and s390 are doing that.

preempt_count and need_resched share the same 8 bytes. preempt_count is 
the lower 32 bits, need_resched is the upper 32 bits.

Atomic instruction is usually slower than load + add + store on ARM64 if 
the cache line is not contended. We may save one branch + load, but my 
profiling didn't show branch is a major contributing factor. The 
performance gain mainly comes from fewer instructions and icache hit 
rate improvement due to the elimination of preempt_disable/enable.

Thanks,
Yang



^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2026-05-28 20:34 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
2026-05-20 12:43   ` David Laight
2026-05-20 13:50     ` Heiko Carstens
2026-05-20 14:16       ` Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 2/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 3/9] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
2026-05-20 18:42 ` [PATCH v3 0/9] s390: Improve this_cpu operations Yang Shi
2026-05-20 22:34   ` David Laight
2026-05-21  0:23     ` Yang Shi
2026-05-21 10:17       ` David Laight
2026-05-21 16:57         ` Yang Shi
2026-05-21 17:55           ` David Laight
2026-05-21 20:46             ` Yang Shi
2026-05-21 22:13               ` David Laight
2026-05-21 23:41                 ` Yang Shi
2026-05-21 10:23       ` David Laight
2026-05-21 17:48         ` Yang Shi
2026-05-21 10:37       ` Heiko Carstens
2026-05-21 17:47         ` Yang Shi
2026-05-22  9:18           ` Heiko Carstens
2026-05-27 19:09             ` Christoph Lameter (Ampere)
2026-05-27 20:38               ` Yang Shi
2026-05-28  8:36                 ` David Laight
2026-05-27 23:44             ` Yang Shi
2026-05-28  9:03               ` David Laight
2026-05-28 19:19                 ` Yang Shi
2026-05-28 20:34                   ` David Laight
2026-05-28 14:14               ` Heiko Carstens
2026-05-28 17:14                 ` David Laight
2026-05-28 18:39                 ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox