public inbox for linux-s390@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/9] s390: Improve this_cpu operations
@ 2026-03-19 12:04 Heiko Carstens
  2026-03-19 12:04 ` [PATCH v2 1/9] s390/percpu: Provide arch_raw_cpu_ptr() Heiko Carstens
                   ` (9 more replies)
  0 siblings, 10 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:04 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

v2:

- Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
  warnings

- Add missing __packed attribute to insn structure [Sashiko [2]]

- Fix inverted if condition [Sashiko [2]]

- Add missing user_mode() check [Sashiko [2]]

- Move percpu_entry() call in front of irqentry_enter() call in all
  entry paths to avoid that potential this_cpu() operations overwrite
  the not-yet saved percpu code section indicator  [Sashiko [2]]

[2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com

v1:

This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

To avoid this Peter suggested an in-kernel rseq approach. While this would
certainly work, this series tries to come up with a solution which uses
less instructions and doesn't require to repeat instruction sequences.

The idea is that this_cpu operations based on atomic instructions are
guarded with mvyi instructions:

- The first mvyi instruction writes the register number, which contains
  the percpu address variable to lowcore. This also indicates that a
  percpu code section is executed.

- The first instruction following the mvyi instruction must be the ag
  instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mvyi instruction writes a zero to lowcore, which indicates
  the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
  written to lowcore is copied to the exception frame (pt_regs), and a zero
  is written to lowcore.

- On return to the previous context it is checked if a percpu code section
  was executed (saved register number not zero), and if the process was
  migrated to a different cpu. If the percpu offset was already added to
  the percpu address register (instruction address does _not_ point to the
  ag instruction) the content of the percpu address register is adjusted so
  it points to percpu variable of the new cpu.

All of this seems to work, but of course it could still be broken since I
missed some detail.

In total this series results in a kernel text size reduction of ~106kb. The
number of preempt_schedule_notrace() call sites is reduced from 7089 to
1577.

Note: this comes without any huge performance analysis, however all
microbenchmarks confirmed that the new code is at least as fast as the
old code, like expected.

[1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net

Heiko Carstens (9):
  s390/percpu: Provide arch_raw_cpu_ptr()
  s390/alternatives: Add new ALT_TYPE_PERCPU type
  s390/percpu: Infrastructure for more efficient this_cpu operations
  s390/percpu: Use new percpu code section for arch_this_cpu_add()
  s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
  s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
  s390/percpu: Provide arch_this_cpu_read() implementation
  s390/percpu: Provide arch_this_cpu_write() implementation
  s390/percpu: Remove one and two byte this_cpu operation implementation

 arch/s390/boot/alternative.c         |   7 +
 arch/s390/include/asm/alternative.h  |   5 +
 arch/s390/include/asm/entry-percpu.h |  57 ++++++
 arch/s390/include/asm/lowcore.h      |   3 +-
 arch/s390/include/asm/percpu.h       | 259 ++++++++++++++++++++++-----
 arch/s390/include/asm/ptrace.h       |   2 +
 arch/s390/kernel/alternative.c       |  25 ++-
 arch/s390/kernel/irq.c               |  17 +-
 arch/s390/kernel/nmi.c               |   3 +
 arch/s390/kernel/traps.c             |   3 +
 10 files changed, 330 insertions(+), 51 deletions(-)
 create mode 100644 arch/s390/include/asm/entry-percpu.h

-- 
2.51.0


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [PATCH v2 1/9] s390/percpu: Provide arch_raw_cpu_ptr()
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
@ 2026-03-19 12:04 ` Heiko Carstens
  2026-03-19 12:04 ` [PATCH v2 2/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
                   ` (8 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:04 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

Provide an s390 specific arch_raw_cpu_ptr() implementation which avoids the
detour over get_lowcore() to get the lowcore pointer. The inline assembly
is implemented with an alternative so that relocated lowcore (percpu offset
is at a different address) is handled correctly.

This turns code like this

  102f78:       a7 39 00 00             lghi    %r3,0
  102f7c:       e3 20 33 b8 00 08       ag      %r2,952(%r3)

which adds the percpu offset to register r2 into a single instruction

  102f7c:       e3 20 33 b8 00 08       ag      %r2,952(%r0)

and also avoids the need of a base register, thus reducing register
pressure.

With defconfig bloat-o-meter -t provides this result:

add/remove: 12/26 grow/shrink: 183/3391 up/down: 14880/-41950 (-27070)

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 5899f57f17d1..b18a96f3a334 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -12,6 +12,24 @@
  */
 #define __my_cpu_offset get_lowcore()->percpu_offset
 
+#define arch_raw_cpu_ptr(_ptr)						\
+({									\
+	unsigned long lc_percpu, tcp_ptr__;				\
+									\
+	tcp_ptr__ = (__force unsigned long)(_ptr);			\
+	lc_percpu = offsetof(struct lowcore, percpu_offset);		\
+	asm_inline volatile(						\
+	ALTERNATIVE("ag		%[__ptr__],%[offzero](%%r0)\n",		\
+		    "ag		%[__ptr__],%[offalt](%%r0)\n",		\
+		    ALT_FEATURE(MFEATURE_LOWCORE))			\
+	: [__ptr__] "+d" (tcp_ptr__)					\
+	: [offzero] "i" (lc_percpu),					\
+	  [offalt] "i" (lc_percpu + LOWCORE_ALT_ADDRESS),		\
+	  "m" (((struct lowcore *)0)->percpu_offset)			\
+	: "cc");							\
+	(TYPEOF_UNQUAL(*(_ptr)) __force __kernel *)tcp_ptr__;		\
+})
+
 /*
  * We use a compare-and-swap loop since that uses less cpu cycles than
  * disabling and enabling interrupts like the generic variant would do.
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 2/9] s390/alternatives: Add new ALT_TYPE_PERCPU type
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
  2026-03-19 12:04 ` [PATCH v2 1/9] s390/percpu: Provide arch_raw_cpu_ptr() Heiko Carstens
@ 2026-03-19 12:04 ` Heiko Carstens
  2026-03-19 12:04 ` [PATCH v2 3/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:04 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

The upcoming percpu section code uses two mviy instructions to guard the
beginning and end of a percpu code section.

The first mviy instruction writes the register number, which contains the
percpu address to lowcore. This indicates both the beginning of a percpu
code section and which register contains the percpu address.

During compile time the mvyi instruction is generated in a way that its
base register contains the percpu register, and the immediate field is
zero. This needs to be patched so that the base register is zero, and the
immediate field contains the register number. For example

  101424:       eb 00 23 c0 00 52       mviy    960(%r2),0

needs to be patched to

  101424:       eb 20 03 c0 00 52       mviy    960(%r0),2

Provide a new ALT_TYPE_PERCPU alternative type which handles this specific
instruction patching. In addition it also handles the relocated lowcore
case, where the displacement of the mviy instruction has a different value.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/boot/alternative.c        |  7 +++++++
 arch/s390/include/asm/alternative.h |  5 +++++
 arch/s390/kernel/alternative.c      | 25 +++++++++++++++++++++++--
 3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/arch/s390/boot/alternative.c b/arch/s390/boot/alternative.c
index 19ea7934b918..ad078a2b1192 100644
--- a/arch/s390/boot/alternative.c
+++ b/arch/s390/boot/alternative.c
@@ -22,6 +22,9 @@ static void alt_debug_all(int type)
 	case ALT_TYPE_SPEC:
 		alt_debug.spec = 1;
 		break;
+	case ALT_TYPE_PERCPU:
+		alt_debug.percpu = 1;
+		break;
 	}
 }
 
@@ -115,6 +118,7 @@ void alt_debug_setup(char *str)
 		alt_debug_all(ALT_TYPE_FACILITY);
 		alt_debug_all(ALT_TYPE_FEATURE);
 		alt_debug_all(ALT_TYPE_SPEC);
+		alt_debug_all(ALT_TYPE_PERCPU);
 		return;
 	}
 	while (*str) {
@@ -130,6 +134,9 @@ void alt_debug_setup(char *str)
 		case ALT_TYPE_SPEC:
 			alt_debug_all(ALT_TYPE_SPEC);
 			break;
+		case ALT_TYPE_PERCPU:
+			alt_debug_all(ALT_TYPE_PERCPU);
+			break;
 		}
 		if (*str != ';')
 			break;
diff --git a/arch/s390/include/asm/alternative.h b/arch/s390/include/asm/alternative.h
index 1c56480def9e..9ca2e49338a2 100644
--- a/arch/s390/include/asm/alternative.h
+++ b/arch/s390/include/asm/alternative.h
@@ -34,6 +34,7 @@
 #define ALT_TYPE_FACILITY	0
 #define ALT_TYPE_FEATURE	1
 #define ALT_TYPE_SPEC		2
+#define ALT_TYPE_PERCPU		3
 
 #define ALT_DATA_SHIFT		0
 #define ALT_TYPE_SHIFT		20
@@ -51,6 +52,10 @@
 					 ALT_TYPE_SPEC << ALT_TYPE_SHIFT	| \
 					 (facility) << ALT_DATA_SHIFT)
 
+#define ALT_PERCPU(num)			(ALT_CTX_EARLY << ALT_CTX_SHIFT		| \
+					 ALT_TYPE_PERCPU << ALT_TYPE_SHIFT	| \
+					 (num) << ALT_DATA_SHIFT)
+
 #ifndef __ASSEMBLER__
 
 #include <linux/types.h>
diff --git a/arch/s390/kernel/alternative.c b/arch/s390/kernel/alternative.c
index 02d04ae621ba..a79a11879c2f 100644
--- a/arch/s390/kernel/alternative.c
+++ b/arch/s390/kernel/alternative.c
@@ -28,6 +28,7 @@ struct alt_debug {
 	unsigned long facilities[MAX_FACILITY_BIT / BITS_PER_LONG];
 	unsigned long mfeatures[MAX_MFEATURE_BIT / BITS_PER_LONG];
 	int spec;
+	int percpu;
 };
 
 static struct alt_debug __bootdata_preserved(alt_debug);
@@ -48,8 +49,18 @@ static void alternative_dump(u8 *old, u8 *new, unsigned int len, unsigned int ty
 	a_debug("[%d/%3d] %016lx: %s -> %s\n", type, data, kptr, oinsn, ninsn);
 }
 
+struct insn_siy {
+	u64	opc1 : 8;
+	u64	i2   : 8;
+	u64	b1   : 4;
+	u64	dl1  : 12;
+	u64	dh1  : 8;
+	u64	opc2 : 8;
+} __packed;
+
 void __apply_alternatives(struct alt_instr *start, struct alt_instr *end, unsigned int ctx)
 {
+	struct insn_siy insn_siy;
 	struct alt_debug *d;
 	struct alt_instr *a;
 	bool debug, replace;
@@ -63,6 +74,8 @@ void __apply_alternatives(struct alt_instr *start, struct alt_instr *end, unsign
 	for (a = start; a < end; a++) {
 		if (!(a->ctx & ctx))
 			continue;
+		old = (u8 *)&a->instr_offset + a->instr_offset;
+		new = (u8 *)&a->repl_offset + a->repl_offset;
 		switch (a->type) {
 		case ALT_TYPE_FACILITY:
 			replace = test_facility(a->data);
@@ -76,14 +89,22 @@ void __apply_alternatives(struct alt_instr *start, struct alt_instr *end, unsign
 			replace = nobp_enabled();
 			debug = d->spec;
 			break;
+		case ALT_TYPE_PERCPU:
+			replace = true;
+			insn_siy = *(struct insn_siy *)old;
+			if (test_machine_feature(MFEATURE_LOWCORE))
+				insn_siy = *(struct insn_siy *)new;
+			insn_siy.i2 = insn_siy.b1;
+			insn_siy.b1 = 0;
+			new = (u8 *)&insn_siy;
+			debug = d->percpu;
+			break;
 		default:
 			replace = false;
 			debug = false;
 		}
 		if (!replace)
 			continue;
-		old = (u8 *)&a->instr_offset + a->instr_offset;
-		new = (u8 *)&a->repl_offset + a->repl_offset;
 		if (debug)
 			alternative_dump(old, new, a->instrlen, a->type, a->data);
 		s390_kernel_write(old, new, a->instrlen);
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 3/9] s390/percpu: Infrastructure for more efficient this_cpu operations
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
  2026-03-19 12:04 ` [PATCH v2 1/9] s390/percpu: Provide arch_raw_cpu_ptr() Heiko Carstens
  2026-03-19 12:04 ` [PATCH v2 2/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
@ 2026-03-19 12:04 ` Heiko Carstens
  2026-03-19 12:04 ` [PATCH v2 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:04 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.

In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.

E.g. this simple C code sequence

DEFINE_PER_CPU(long, foo);
long bar(long a) { return this_cpu_add_return(foo, a); }

generates this code:

  11a976:       eb af f0 68 00 24       stmg    %r10,%r15,104(%r15)
  11a97c:       b9 04 00 ef             lgr     %r14,%r15
  11a980:       b9 04 00 b2             lgr     %r11,%r2
  11a984:       e3 f0 ff c8 ff 71       lay     %r15,-56(%r15)
  11a98a:       e3 e0 f0 98 00 24       stg     %r14,152(%r15)
  11a990:       eb 01 03 a8 00 6a       asi     936,1            <- __preempt_count_add(1)
  11a996:       c0 10 00 d2 ac b5       larl    %r1,1b70300      <- address of percpu var
  11a9a0:       e3 10 23 b8 00 08       ag      %r1,952          <- add percpu offset
  11a9a6:       eb ab 10 00 00 e8       laag    %r10,%r11,0(%r1) <- atomic op
  11a9ac:       eb ff 03 a8 00 6e       alsi    936,-1           <- __preempt_count_dec_and_test()
  11a9b2:       a7 54 00 05             jnhe    11a9bc <bar+0x4c>
  11a9b6:       c0 e5 00 76 d1 bd       brasl   %r14,ff4d30 <preempt_schedule_notrace>
  11a9bc:       b9 e8 b0 2a             agrk    %r2,%r10,%r11
  11a9c0:       eb af f0 a0 00 04       lmg     %r10,%r15,160(%r15)
  11a9c6        07 fe                   br      %r14

Even though the above example is more or less the worst case, since the
branch to preempt_schedule_notrace() requires a stackframe, which
otherwise wouldn't be necessary, there is also the conditional jnhe branch
instruction.

Get rid of the conditional branch with the following code sequence:

  11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
  11a8ec:       b9 04 00 43             lgr     %r4,%r3
  11a8f0:       eb 00 43 c0 00 52       mviy    960,4
  11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
  11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
  11a902:       eb 00 03 c0 00 52       mviy    960,0
  11a908:       b9 08 00 25             agr     %r2,%r5
  11a90c        07 fe                   br      %r14

The general idea is that this_cpu operations based on atomic instructions
are guarded with mvyi instructions:

- The first mvyi instruction writes the register number, which contains
  the percpu address variable to lowcore. This also indicates that a
  percpu code section is executed.

- The first instruction following the mvyi instruction must be the ag
  instruction which adds the percpu offset to the percpu address register.

- Afterwards the atomic percpu operation follows.

- Then a second mvyi instruction writes a zero to lowcore, which indicates
  the end of the percpu code section.

- In case of an interrupt/exception/nmi the register number which was
  written to lowcore is copied to the exception frame (pt_regs), and a zero
  is written to lowcore.

- On return to the previous context it is checked if a percpu code section
  was executed (saved register number not zero), and if the process was
  migrated to a different cpu. If the percpu offset was already added to
  the percpu address register (instruction address does _not_ point to the
  ag instruction) the content of the percpu address register is adjusted so
  it points to percpu variable of the new cpu.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/entry-percpu.h | 57 ++++++++++++++++++++++++++++
 arch/s390/include/asm/lowcore.h      |  3 +-
 arch/s390/include/asm/percpu.h       | 54 ++++++++++++++++++++++++++
 arch/s390/include/asm/ptrace.h       |  2 +
 arch/s390/kernel/irq.c               | 17 +++++++--
 arch/s390/kernel/nmi.c               |  3 ++
 arch/s390/kernel/traps.c             |  3 ++
 7 files changed, 134 insertions(+), 5 deletions(-)
 create mode 100644 arch/s390/include/asm/entry-percpu.h

diff --git a/arch/s390/include/asm/entry-percpu.h b/arch/s390/include/asm/entry-percpu.h
new file mode 100644
index 000000000000..7ccfefdd8162
--- /dev/null
+++ b/arch/s390/include/asm/entry-percpu.h
@@ -0,0 +1,57 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ARCH_S390_ENTRY_PERCPU_H
+#define ARCH_S390_ENTRY_PERCPU_H
+
+#include <linux/percpu.h>
+#include <asm/lowcore.h>
+#include <asm/ptrace.h>
+#include <asm/asm-offsets.h>
+
+static __always_inline void percpu_entry(struct pt_regs *regs)
+{
+	struct lowcore *lc = get_lowcore();
+
+	if (user_mode(regs))
+		return;
+	regs->cpu = lc->cpu_nr;
+	regs->percpu_register = lc->percpu_register;
+	lc->percpu_register = 0;
+}
+
+static __always_inline void percpu_exit(struct pt_regs *regs)
+{
+	struct lowcore *lc = get_lowcore();
+	unsigned int insn, disp;
+	unsigned char reg;
+
+	if (user_mode(regs))
+		return;
+	reg = regs->percpu_register;
+	if (!reg)
+		return;
+	lc->percpu_register = reg;
+	if (regs->cpu == lc->cpu_nr)
+		return;
+	/*
+	 * Within a percpu code section and process has been migrated to
+	 * a different CPU. Check if the percpu base register needs to be
+	 * updated. This is the case if the PSW does not point to the ADD
+	 * instruction within the section
+	 * - AG %rx,percpu_offset_in_lowcore(%r0,%r0)
+	 * which adds the percpu offset to the percpu base register.
+	 */
+	if ((*(u16 *)psw_bits(regs->psw).ia & 0xff0f) != 0xe300)
+		goto fixup;
+	disp = offsetof(struct lowcore, percpu_offset);
+	if (machine_has_relocated_lowcore())
+		disp += LOWCORE_ALT_ADDRESS;
+	insn = (disp & 0xff000) >> 4 | (disp & 0x00fff) << 16 | 0x8;
+	if (*(u32 *)(psw_bits(regs->psw).ia + 2) != insn)
+		return;
+fixup:
+	/* Fixup percpu base register */
+	regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
+	regs->gprs[reg] += lc->percpu_offset;
+}
+
+#endif
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index 50ffe75adeb4..cd1ddfdb5d35 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -165,7 +165,8 @@ struct lowcore {
 	__u32	spinlock_index;			/* 0x03b0 */
 	__u8	pad_0x03b4[0x03b8-0x03b4];	/* 0x03b4 */
 	__u64	percpu_offset;			/* 0x03b8 */
-	__u8	pad_0x03c0[0x0400-0x03c0];	/* 0x03c0 */
+	__u8	percpu_register;		/* 0x03c0 */
+	__u8	pad_0x03c1[0x0400-0x03c1];	/* 0x03c1 */
 
 	__u32	return_lpswe;			/* 0x0400 */
 	__u32	return_mcck_lpswe;		/* 0x0404 */
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index b18a96f3a334..05eb91428b42 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -60,6 +60,60 @@
 #define this_cpu_or_1(pcp, val)		arch_this_cpu_to_op_simple(pcp, val, |)
 #define this_cpu_or_2(pcp, val)		arch_this_cpu_to_op_simple(pcp, val, |)
 
+/*
+ * Macros to be used for percpu code section based on atomic instructions.
+ *
+ * Avoid the need to use preempt_disable() / preempt_disable() pairs and the
+ * conditional preempt_schedule_notrace() function calls which come with
+ * this. The idea is that this_cpu operations based on atomic instructions are
+ * guarded with mvyi instructions:
+ *
+ * - The first mvyi instruction writes the register number, which contains the
+ *   percpu address variable to lowcore. This also indicates that a percpu
+ *   code section is executed.
+ *
+ * - The first mvyi instruction following the mvyi instruction must be the ag
+ *   instruction which adds the percpu offset to the percpu address register.
+ *
+ * - Afterwards the atomic percpu operation follows.
+ *
+ * - Then a second mvyi instruction writes a zero to lowcore, which indicates
+ *   the end of the percpu code section.
+ *
+ * - In case of an interrupt/exception/nmi the register number which was
+ *   written to lowcore is copied to the exception frame (pt_regs), and a zero
+ *   is written to lowcore.
+ *
+ * - On return to the previous context it is checked if a percpu code section
+ *   was executed (saved register number not zero), and if the process was
+ *   migrated to a different cpu. If the percpu offset was already added to
+ *   the percpu address register (instruction address does _not_ point to the
+ *   ag instruction) the content of the percpu address register is adjusted so
+ *   it points to percpu variable of the new cpu.
+ *
+ * Inline assemblies making use of this typically have a code sequence like:
+ *
+ *   MVIY_PERCPU(...) <- start of percpu code section
+ *   AG_ALT(...)      <- add percpu offset; must be the second instruction
+ *   atomic_op	      <- atomic op
+ *   MVIY_ALT(...)    <- end of percpu code section
+ */
+
+#define MVIY_PERCPU(disp, dispalt, base)				\
+	ALTERNATIVE("	mviy	" disp	  "(" base " ),0\n",		\
+		    "	mviy	" dispalt "(" base " ),0\n",		\
+		    ALT_PERCPU(0))
+
+#define MVIY_ALT(disp, dispalt, base)					\
+	ALTERNATIVE("	mviy	" disp	  "(" base " ),0\n",		\
+		    "	mviy	" dispalt "(" base " ),0\n",		\
+		    ALT_FEATURE(MFEATURE_LOWCORE))
+
+#define AG_ALT(disp, dispalt, reg)					\
+	ALTERNATIVE("	ag	" reg ", " disp    "(%%r0)\n",		\
+		    "	ag	" reg ", " dispalt "(%%r0)\n",		\
+		    ALT_FEATURE(MFEATURE_LOWCORE))
+
 #ifndef MARCH_HAS_Z196_FEATURES
 
 #define this_cpu_add_4(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, +)
diff --git a/arch/s390/include/asm/ptrace.h b/arch/s390/include/asm/ptrace.h
index aaceb1d9110a..495e310c3d6d 100644
--- a/arch/s390/include/asm/ptrace.h
+++ b/arch/s390/include/asm/ptrace.h
@@ -134,6 +134,8 @@ struct pt_regs {
 	};
 	unsigned long flags;
 	unsigned long last_break;
+	unsigned int cpu;
+	unsigned char percpu_register;
 };
 
 /*
diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index d10a17e6531d..43c92b8faade 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -33,6 +33,7 @@
 #include <asm/softirq_stack.h>
 #include <asm/vtime.h>
 #include <asm/asm.h>
+#include <asm/entry-percpu.h>
 #include "entry.h"
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
@@ -142,10 +143,13 @@ static int irq_pending(struct pt_regs *regs)
 
 void noinstr do_io_irq(struct pt_regs *regs)
 {
-	irqentry_state_t state = irqentry_enter(regs);
-	struct pt_regs *old_regs = set_irq_regs(regs);
+	struct pt_regs *old_regs;
+	irqentry_state_t state;
 	bool from_idle;
 
+	percpu_entry(regs);
+	state = irqentry_enter(regs);
+	old_regs = set_irq_regs(regs);
 	from_idle = test_and_clear_cpu_flag(CIF_ENABLED_WAIT);
 	if (from_idle)
 		update_timer_idle();
@@ -177,14 +181,18 @@ void noinstr do_io_irq(struct pt_regs *regs)
 
 	if (from_idle)
 		regs->psw.mask &= ~(PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_WAIT);
+	percpu_exit(regs);
 }
 
 void noinstr do_ext_irq(struct pt_regs *regs)
 {
-	irqentry_state_t state = irqentry_enter(regs);
-	struct pt_regs *old_regs = set_irq_regs(regs);
+	struct pt_regs *old_regs;
+	irqentry_state_t state;
 	bool from_idle;
 
+	percpu_entry(regs);
+	state = irqentry_enter(regs);
+	old_regs = set_irq_regs(regs);
 	from_idle = test_and_clear_cpu_flag(CIF_ENABLED_WAIT);
 	if (from_idle)
 		update_timer_idle();
@@ -212,6 +220,7 @@ void noinstr do_ext_irq(struct pt_regs *regs)
 
 	if (from_idle)
 		regs->psw.mask &= ~(PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_WAIT);
+	percpu_exit(regs);
 }
 
 static void show_msi_interrupt(struct seq_file *p, int irq)
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index a55abbf65333..20fd319a3a8e 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -22,6 +22,7 @@
 #include <linux/module.h>
 #include <linux/sched/signal.h>
 #include <linux/kvm_host.h>
+#include <asm/entry-percpu.h>
 #include <asm/lowcore.h>
 #include <asm/ctlreg.h>
 #include <asm/fpu.h>
@@ -374,6 +375,7 @@ void notrace s390_do_machine_check(struct pt_regs *regs)
 	unsigned long mcck_dam_code;
 	int mcck_pending = 0;
 
+	percpu_entry(regs);
 	irq_state = irqentry_nmi_enter(regs);
 
 	if (user_mode(regs))
@@ -496,6 +498,7 @@ void notrace s390_do_machine_check(struct pt_regs *regs)
 		schedule_mcck_handler();
 
 	irqentry_nmi_exit(regs, irq_state);
+	percpu_exit(regs);
 }
 NOKPROBE_SYMBOL(s390_do_machine_check);
 
diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 1b5c6fc431cc..095937c2ca72 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -24,6 +24,7 @@
 #include <linux/entry-common.h>
 #include <linux/kmsan.h>
 #include <linux/bug.h>
+#include <asm/entry-percpu.h>
 #include <asm/asm-extable.h>
 #include <asm/irqflags.h>
 #include <asm/ptrace.h>
@@ -349,6 +350,7 @@ void noinstr __do_pgm_check(struct pt_regs *regs)
 		current->thread.gmap_int_code = regs->int_code & 0xffff;
 		return;
 	}
+	percpu_entry(regs);
 	state = irqentry_enter(regs);
 	if (user_mode(regs)) {
 		update_timer_sys();
@@ -386,6 +388,7 @@ void noinstr __do_pgm_check(struct pt_regs *regs)
 out:
 	local_irq_disable();
 	irqentry_exit(regs, state);
+	percpu_exit(regs);
 }
 
 /*
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add()
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (2 preceding siblings ...)
  2026-03-19 12:04 ` [PATCH v2 3/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
@ 2026-03-19 12:04 ` Heiko Carstens
  2026-03-19 12:04 ` [PATCH v2 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:04 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

Convert arch_this_cpu_add() to make use of the new percpu code section
infrastructure.

With this the text size of the kernel image is reduced by ~76kb
(defconfig). Also more than 5300 generated preempt_schedule_notrace()
function calls within the kernel image (modules not counted) are removed.

With:

DEFINE_PER_CPU(long, foo);
void bar(long a) { this_cpu_add(foo, a); }

Old arch_this_cpu_add() looks like this:

00000000000000c0 <bar>:
  c0:   c0 04 00 00 00 00       jgnop   c0 <bar>
  c6:   eb 01 03 a8 00 6a       asi     936,1
  cc:   c4 18 00 00 00 00       lgrl    %r1,cc <bar+0xc>
                        ce: R_390_GOTENT        foo+0x2
  d2:   e3 10 03 b8 00 08       ag      %r1,952
  d8:   eb 22 10 00 00 e8       laag    %r2,%r2,0(%r1)
  de:   eb ff 03 a8 00 6e       alsi    936,-1
  e4:   a7 a4 00 05             jhe     ee <bar+0x2e>
  e8:   c0 f4 00 00 00 00       jg      e8 <bar+0x28>
                        ea: R_390_PC32DBL       __s390_indirect_jump_r14+0x2
  ee:   c0 f4 00 00 00 00       jg      ee <bar+0x2e>
                        f0: R_390_PLT32DBL      preempt_schedule_notrace+0x2

New arch_this_cpu_add() looks like this:

00000000000000c0 <bar>:
  c0:   c0 04 00 00 00 00       jgnop   c0 <bar>
  c6:   c4 38 00 00 00 00       lgrl    %r3,c6 <bar+0x6>
                        c8: R_390_GOTENT        foo+0x2
  cc:   b9 04 00 43             lgr     %r4,%r3
  d0:   eb 00 43 c0 00 52       mviy    960(%r4),0
  d6:   e3 40 03 b8 00 08       ag      %r4,952
  dc:   eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
  e2:   eb 00 03 c0 00 52       mviy    960,0
  e8:   c0 f4 00 00 00 00       jg      e8 <bar+0x28>
                        ea: R_390_PC32DBL       __s390_indirect_jump_r14+0x2

Note that the conditional function call is removed.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 65 ++++++++++++++++++++++------------
 1 file changed, 43 insertions(+), 22 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 05eb91428b42..3c5364475e3e 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -127,28 +127,49 @@
 
 #else /* MARCH_HAS_Z196_FEATURES */
 
-#define arch_this_cpu_add(pcp, val, op1, op2, szcast)			\
-{									\
-	typedef typeof(pcp) pcp_op_T__; 				\
-	pcp_op_T__ val__ = (val);					\
-	pcp_op_T__ old__, *ptr__;					\
-	preempt_disable_notrace();					\
-	ptr__ = raw_cpu_ptr(&(pcp)); 				\
-	if (__builtin_constant_p(val__) &&				\
-	    ((szcast)val__ > -129) && ((szcast)val__ < 128)) {		\
-		asm volatile(						\
-			op2 "   %[ptr__],%[val__]"			\
-			: [ptr__] "+Q" (*ptr__) 			\
-			: [val__] "i" ((szcast)val__)			\
-			: "cc");					\
-	} else {							\
-		asm volatile(						\
-			op1 "   %[old__],%[val__],%[ptr__]"		\
-			: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)	\
-			: [val__] "d" (val__)				\
-			: "cc");					\
-	}								\
-	preempt_enable_notrace();					\
+#define arch_this_cpu_add(pcp, val, op1, op2, szcast)				\
+{										\
+	unsigned long lc_pcpr, lc_pcpo;						\
+	typedef typeof(pcp) pcp_op_T__;						\
+	pcp_op_T__ val__ = (val);						\
+	pcp_op_T__ old__, *ptr__;						\
+										\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);			\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);			\
+	ptr__ = PERCPU_PTR(&(pcp));						\
+	if (__builtin_constant_p(val__) &&					\
+	    ((szcast)val__ > -129) && ((szcast)val__ < 128)) {			\
+		asm volatile(							\
+			MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+			AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+			op2 "   0(%[ptr__]),%[val__]\n"				\
+			MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+			: [ptr__] "+&a" (ptr__), "+m" (*ptr__),			\
+			  "=m" (((struct lowcore *)0)->percpu_register)		\
+			: [val__] "i" ((szcast)val__),				\
+			  [disppcpr] "i" (lc_pcpr),				\
+			  [disppcpo] "i" (lc_pcpo),				\
+			  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+			  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+			  "m" (((struct lowcore *)0)->percpu_offset)		\
+			: "cc");						\
+	} else {								\
+		asm volatile(							\
+			MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+			AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+			op1 "   %[old__],%[val__],0(%[ptr__])\n"		\
+			MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+			: [old__] "=&d" (old__),				\
+			  [ptr__] "+&a" (ptr__),  "+m" (*ptr__),		\
+			  "=m" (((struct lowcore *)0)->percpu_register)		\
+			: [val__] "d" (val__),					\
+			  [disppcpr] "i" (lc_pcpr),				\
+			  [disppcpo] "i" (lc_pcpo),				\
+			  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+			  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+			  "m" (((struct lowcore *)0)->percpu_offset)		\
+			: "cc");						\
+	}									\
 }
 
 #define this_cpu_add_4(pcp, val) arch_this_cpu_add(pcp, val, "laa", "asi", int)
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (3 preceding siblings ...)
  2026-03-19 12:04 ` [PATCH v2 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
@ 2026-03-19 12:04 ` Heiko Carstens
  2026-03-19 12:05 ` [PATCH v2 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:04 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

Convert arch_this_cpu_add_return() to make use of the new percpu code
section infrastructure.

With this the text size of the kernel image is reduced by ~4k
(defconfig). Also 66 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 3c5364475e3e..2479b4748510 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -177,17 +177,29 @@
 
 #define arch_this_cpu_add_return(pcp, val, op)				\
 ({									\
+	unsigned long lc_pcpr, lc_pcpo;					\
 	typedef typeof(pcp) pcp_op_T__; 				\
 	pcp_op_T__ val__ = (val);					\
 	pcp_op_T__ old__, *ptr__;					\
-	preempt_disable_notrace();					\
-	ptr__ = raw_cpu_ptr(&(pcp));	 				\
-	asm volatile(							\
-		op "    %[old__],%[val__],%[ptr__]"			\
-		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
-		: [val__] "d" (val__)					\
+									\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);		\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);		\
+	ptr__ = PERCPU_PTR(&(pcp));					\
+	asm_inline volatile(						\
+		MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+		AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+		op "	%[old__],%[val__],0(%[ptr__])\n"		\
+		MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+		: [old__] "=&d" (old__),				\
+		  [ptr__] "+&a" (ptr__), "+m" (*ptr__),			\
+		  "=m" (((struct lowcore *)0)->percpu_register)		\
+		: [val__] "d" (val__),					\
+		  [disppcpr] "i" (lc_pcpr),				\
+		  [disppcpo] "i" (lc_pcpo),				\
+		  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+		  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+		  "m" (((struct lowcore *)0)->percpu_offset)		\
 		: "cc");						\
-	preempt_enable_notrace();						\
 	old__ + val__;							\
 })
 
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (4 preceding siblings ...)
  2026-03-19 12:04 ` [PATCH v2 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
@ 2026-03-19 12:05 ` Heiko Carstens
  2026-03-19 12:05 ` [PATCH v2 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:05 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

Convert arch_this_cpu_[and|or]() to make use of the new percpu code
section infrastructure.

There is no user of this_cpu_and() and only one user of this_cpu_or()
within the kernel. Therefore this conversion has hardly any effect,
and also removes only preempt_schedule_notrace() function call.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 26 +++++++++++++++++++-------
 1 file changed, 19 insertions(+), 7 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 2479b4748510..510d9ce1ee47 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -208,17 +208,29 @@
 
 #define arch_this_cpu_to_op(pcp, val, op)				\
 {									\
+	unsigned long lc_pcpr, lc_pcpo;					\
 	typedef typeof(pcp) pcp_op_T__; 				\
 	pcp_op_T__ val__ = (val);					\
 	pcp_op_T__ old__, *ptr__;					\
-	preempt_disable_notrace();					\
-	ptr__ = raw_cpu_ptr(&(pcp));	 				\
-	asm volatile(							\
-		op "    %[old__],%[val__],%[ptr__]"			\
-		: [old__] "=d" (old__), [ptr__] "+Q" (*ptr__)		\
-		: [val__] "d" (val__)					\
+									\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);		\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);		\
+	ptr__ = PERCPU_PTR(&(pcp));					\
+	asm_inline volatile(						\
+		MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+		AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+		op "    %[old__],%[val__],0(%[ptr__])\n"		\
+		MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+		: [old__] "=&d" (old__),				\
+		  [ptr__] "+&a" (ptr__), "+m" (*ptr__),			\
+		  "=m" (((struct lowcore *)0)->percpu_register)		\
+		: [val__] "d" (val__),					\
+		  [disppcpr] "i" (lc_pcpr),				\
+		  [disppcpo] "i" (lc_pcpo),				\
+		  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+		  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+		  "m" (((struct lowcore *)0)->percpu_offset)		\
 		: "cc");						\
-	preempt_enable_notrace();					\
 }
 
 #define this_cpu_and_4(pcp, val)	arch_this_cpu_to_op(pcp, val, "lan")
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 7/9] s390/percpu: Provide arch_this_cpu_read() implementation
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (5 preceding siblings ...)
  2026-03-19 12:05 ` [PATCH v2 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
@ 2026-03-19 12:05 ` Heiko Carstens
  2026-03-19 12:05 ` [PATCH v2 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:05 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

Provide an s390 specific implementation of arch_this_cpu_read() instead
of the generic variant. The generic variant uses preempt_disable() /
preempt_enable() pair and READ_ONCE().

Get rid of the preempt_disable() / preempt_enable() pairs by providing an
own variant which makes use of the new percpu code section infrastructure.

With this the text size of the kernel image is reduced by ~1k
(defconfig). Also 87 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 510d9ce1ee47..08c48fa97381 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -240,6 +240,37 @@
 
 #endif /* MARCH_HAS_Z196_FEATURES */
 
+#define arch_this_cpu_read(pcp, op)					\
+({									\
+	unsigned long lc_pcpr, lc_pcpo;					\
+	typedef typeof(pcp) pcp_op_T__;					\
+	pcp_op_T__ val__, *ptr__;					\
+									\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);		\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);		\
+	ptr__ = PERCPU_PTR(&(pcp));					\
+	asm_inline volatile(						\
+		MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+		AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+		op "	%[val__],0(%[ptr__])\n"				\
+		MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+		: [val__] "=&d" (val__), [ptr__] "+&a" (ptr__),		\
+		  "=m" (((struct lowcore *)0)->percpu_register)		\
+		: [disppcpr] "i" (lc_pcpr),				\
+		  [disppcpo] "i" (lc_pcpo),				\
+		  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+		  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+		  "m" (*ptr__),						\
+		  "m" (((struct lowcore *)0)->percpu_offset)		\
+		: "cc");						\
+	val__;								\
+})
+
+#define this_cpu_read_1(pcp) arch_this_cpu_read(pcp, "ic")
+#define this_cpu_read_2(pcp) arch_this_cpu_read(pcp, "lh")
+#define this_cpu_read_4(pcp) arch_this_cpu_read(pcp, "l")
+#define this_cpu_read_8(pcp) arch_this_cpu_read(pcp, "lg")
+
 #define arch_this_cpu_cmpxchg(pcp, oval, nval)				\
 ({									\
 	typedef typeof(pcp) pcp_op_T__;					\
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 8/9] s390/percpu: Provide arch_this_cpu_write() implementation
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (6 preceding siblings ...)
  2026-03-19 12:05 ` [PATCH v2 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
@ 2026-03-19 12:05 ` Heiko Carstens
  2026-03-19 12:05 ` [PATCH v2 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
  2026-03-19 13:56 ` [PATCH v2 0/9] s390: Improve this_cpu operations Peter Zijlstra
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:05 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

Provide an s390 specific implementation of arch_this_cpu_write()
instead of the generic variant. The generic variant uses a quite
expensive raw_local_irq_save() / raw_local_irq_restore() pair.

Get rid of this by providing an own variant which makes use of the new
percpu code section infrastructure.

With this the text size of the kernel image is reduced by ~1k (defconfig).

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 32 ++++++++++++++++++++++++++++++++
 1 file changed, 32 insertions(+)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 08c48fa97381..44501a407e6d 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -271,6 +271,38 @@
 #define this_cpu_read_4(pcp) arch_this_cpu_read(pcp, "l")
 #define this_cpu_read_8(pcp) arch_this_cpu_read(pcp, "lg")
 
+#define arch_this_cpu_write(pcp, val, op)				\
+{									\
+	unsigned long lc_pcpr, lc_pcpo;					\
+	typedef typeof(pcp) pcp_op_T__;					\
+	pcp_op_T__ val__ = (val);					\
+	pcp_op_T__ old__, *ptr__;					\
+									\
+	lc_pcpr = offsetof(struct lowcore, percpu_register);		\
+	lc_pcpo = offsetof(struct lowcore, percpu_offset);		\
+	ptr__ = PERCPU_PTR(&(pcp));					\
+	asm_inline volatile(						\
+		MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+		AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]")	\
+		op "    %[val__],0(%[ptr__])\n"				\
+		MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]", "%%r0")	\
+		: [old__] "=&d" (old__),				\
+		  [ptr__] "+&a" (ptr__), "=m" (*ptr__),			\
+		  "=m" (((struct lowcore *)0)->percpu_register)		\
+		: [val__] "d" (val__),					\
+		  [disppcpr] "i" (lc_pcpr),				\
+		  [disppcpo] "i" (lc_pcpo),				\
+		  [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS),	\
+		  [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS),	\
+		  "m" (((struct lowcore *)0)->percpu_offset)		\
+		: "cc");						\
+}
+
+#define this_cpu_write_1(pcp, val) arch_this_cpu_write(pcp, val, "stc")
+#define this_cpu_write_2(pcp, val) arch_this_cpu_write(pcp, val, "sth")
+#define this_cpu_write_4(pcp, val) arch_this_cpu_write(pcp, val, "st")
+#define this_cpu_write_8(pcp, val) arch_this_cpu_write(pcp, val, "stg")
+
 #define arch_this_cpu_cmpxchg(pcp, oval, nval)				\
 ({									\
 	typedef typeof(pcp) pcp_op_T__;					\
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH v2 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (7 preceding siblings ...)
  2026-03-19 12:05 ` [PATCH v2 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
@ 2026-03-19 12:05 ` Heiko Carstens
  2026-03-19 13:56 ` [PATCH v2 0/9] s390: Improve this_cpu operations Peter Zijlstra
  9 siblings, 0 replies; 13+ messages in thread
From: Heiko Carstens @ 2026-03-19 12:05 UTC (permalink / raw)
  To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ
  Cc: Peter Zijlstra, linux-kernel, linux-s390

There are no one and two byte this_cpu operations within the kernel
(defconfig). However even if there would be the s390 implementation, which
uses a cmpxchg loop, generates a very large code sequence due to the lack
of native one and two byte cmpxchg instructions.

Remove the s390 implementation and use the generic implementation.

Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
 arch/s390/include/asm/percpu.h | 9 ---------
 1 file changed, 9 deletions(-)

diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 44501a407e6d..b5b61471439b 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -51,15 +51,6 @@
 	new__;								\
 })
 
-#define this_cpu_add_1(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, +)
-#define this_cpu_add_2(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, +)
-#define this_cpu_add_return_1(pcp, val) arch_this_cpu_to_op_simple(pcp, val, +)
-#define this_cpu_add_return_2(pcp, val) arch_this_cpu_to_op_simple(pcp, val, +)
-#define this_cpu_and_1(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, &)
-#define this_cpu_and_2(pcp, val)	arch_this_cpu_to_op_simple(pcp, val, &)
-#define this_cpu_or_1(pcp, val)		arch_this_cpu_to_op_simple(pcp, val, |)
-#define this_cpu_or_2(pcp, val)		arch_this_cpu_to_op_simple(pcp, val, |)
-
 /*
  * Macros to be used for percpu code section based on atomic instructions.
  *
-- 
2.51.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/9] s390: Improve this_cpu operations
  2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
                   ` (8 preceding siblings ...)
  2026-03-19 12:05 ` [PATCH v2 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
@ 2026-03-19 13:56 ` Peter Zijlstra
  2026-03-20 11:39   ` Heiko Carstens
  9 siblings, 1 reply; 13+ messages in thread
From: Peter Zijlstra @ 2026-03-19 13:56 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, linux-kernel, linux-s390

On Thu, Mar 19, 2026 at 01:04:54PM +0100, Heiko Carstens wrote:
> v2:
> 
> - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
>   warnings
> 
> - Add missing __packed attribute to insn structure [Sashiko [2]]
> 
> - Fix inverted if condition [Sashiko [2]]
> 
> - Add missing user_mode() check [Sashiko [2]]
> 
> - Move percpu_entry() call in front of irqentry_enter() call in all
>   entry paths to avoid that potential this_cpu() operations overwrite
>   the not-yet saved percpu code section indicator  [Sashiko [2]]

Would it make sense to add arch hooks to irqentry_{enter,exit}() ?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/9] s390: Improve this_cpu operations
  2026-03-19 13:56 ` [PATCH v2 0/9] s390: Improve this_cpu operations Peter Zijlstra
@ 2026-03-20 11:39   ` Heiko Carstens
  2026-03-20 11:44     ` Peter Zijlstra
  0 siblings, 1 reply; 13+ messages in thread
From: Heiko Carstens @ 2026-03-20 11:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, linux-kernel, linux-s390

On Thu, Mar 19, 2026 at 02:56:12PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 19, 2026 at 01:04:54PM +0100, Heiko Carstens wrote:
> > v2:
> > 
> > - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
> >   warnings
> > 
> > - Add missing __packed attribute to insn structure [Sashiko [2]]
> > 
> > - Fix inverted if condition [Sashiko [2]]
> > 
> > - Add missing user_mode() check [Sashiko [2]]
> > 
> > - Move percpu_entry() call in front of irqentry_enter() call in all
> >   entry paths to avoid that potential this_cpu() operations overwrite
> >   the not-yet saved percpu code section indicator  [Sashiko [2]]
> 
> Would it make sense to add arch hooks to irqentry_{enter,exit}() ?

I guess it would make sense to have some architecture hook which
allows to run code on all entry/exit paths instead of duplicating the
code n times.

But apparently my code seems to have more bugs (e.g. I didn't consider
kprobes which would make the instruction comparison not work). So this
has to wait until I'm back after vacation.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH v2 0/9] s390: Improve this_cpu operations
  2026-03-20 11:39   ` Heiko Carstens
@ 2026-03-20 11:44     ` Peter Zijlstra
  0 siblings, 0 replies; 13+ messages in thread
From: Peter Zijlstra @ 2026-03-20 11:44 UTC (permalink / raw)
  To: Heiko Carstens
  Cc: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
	Christian Borntraeger, Juergen Christ, linux-kernel, linux-s390

On Fri, Mar 20, 2026 at 12:39:09PM +0100, Heiko Carstens wrote:
> On Thu, Mar 19, 2026 at 02:56:12PM +0100, Peter Zijlstra wrote:
> > On Thu, Mar 19, 2026 at 01:04:54PM +0100, Heiko Carstens wrote:
> > > v2:
> > > 
> > > - Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
> > >   warnings
> > > 
> > > - Add missing __packed attribute to insn structure [Sashiko [2]]
> > > 
> > > - Fix inverted if condition [Sashiko [2]]
> > > 
> > > - Add missing user_mode() check [Sashiko [2]]
> > > 
> > > - Move percpu_entry() call in front of irqentry_enter() call in all
> > >   entry paths to avoid that potential this_cpu() operations overwrite
> > >   the not-yet saved percpu code section indicator  [Sashiko [2]]
> > 
> > Would it make sense to add arch hooks to irqentry_{enter,exit}() ?
> 
> I guess it would make sense to have some architecture hook which
> allows to run code on all entry/exit paths instead of duplicating the
> code n times.
> 
> But apparently my code seems to have more bugs (e.g. I didn't consider
> kprobes which would make the instruction comparison not work). So this
> has to wait until I'm back after vacation.

My vote is to reject kprobes in these regions :-) Anyway, enjoy the time
off!

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2026-03-20 11:44 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-19 12:04 [PATCH v2 0/9] s390: Improve this_cpu operations Heiko Carstens
2026-03-19 12:04 ` [PATCH v2 1/9] s390/percpu: Provide arch_raw_cpu_ptr() Heiko Carstens
2026-03-19 12:04 ` [PATCH v2 2/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
2026-03-19 12:04 ` [PATCH v2 3/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
2026-03-19 12:04 ` [PATCH v2 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
2026-03-19 12:04 ` [PATCH v2 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
2026-03-19 12:05 ` [PATCH v2 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
2026-03-19 12:05 ` [PATCH v2 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
2026-03-19 12:05 ` [PATCH v2 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
2026-03-19 12:05 ` [PATCH v2 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
2026-03-19 13:56 ` [PATCH v2 0/9] s390: Improve this_cpu operations Peter Zijlstra
2026-03-20 11:39   ` Heiko Carstens
2026-03-20 11:44     ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox