* [PATCH v5 0/7] s390: Improve this_cpu operations
@ 2026-05-26 5:56 Heiko Carstens
2026-05-26 5:56 ` [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient " Heiko Carstens
` (6 more replies)
0 siblings, 7 replies; 22+ messages in thread
From: Heiko Carstens @ 2026-05-26 5:56 UTC (permalink / raw)
To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ
Cc: linux-kernel, linux-s390
v5:
- Add missing user_mode() check to percpu_exit() [Sashiko [6]]
- Use zero extending instructions for arch_this_cpu_read() [Sashiko [6]]
- Keep 1- and 2-byte implementations since
ARCH_HAS_NMI_SAFE_THIS_CPU_OPS relies on them [Sashiko [6]]
[6] https://sashiko.dev/#/patchset/20260522141257.303617-1-hca%40linux.ibm.com
v4:
- Drop alternatives approach and extract percpu base register number for
mviy at compile time [David Laight]
- Fix logic for percpu code section detection, as well as
interruption/exception/nmi path [Sashiko [5]]
[5] https://sashiko.dev/#/patchset/20260520092243.264847-1-hca%40linux.ibm.com
v3:
- Fix various typos [Juergen Christ]
- Add missing kprobe detection / handling [Sashiko [3]]
[FWIW, this made me also aware of that the current general s390 kprobes
code seems to be racy against concurrent removal of a kprobe while a
probe hit on a different CPU. But that is a different story.]
- Fix various minor findings [Sashiko [3]]
- All of this might be dropped / exchanged in future in favor of the percpu
page table approach proposed by Yang Shi [4].
[3] https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca@linux.ibm.com
[4] https://lore.kernel.org/all/20260429170758.3018959-1-yang@os.amperecomputing.com/
v2:
- Add proper PERCPU_PTR cast to most patches to avoid tons of sparse
warnings
- Add missing __packed attribute to insn structure [Sashiko [2]]
- Fix inverted if condition [Sashiko [2]]
- Add missing user_mode() check [Sashiko [2]]
- Move percpu_entry() call in front of irqentry_enter() call in all
entry paths to avoid that potential this_cpu() operations overwrite
the not-yet saved percpu code section indicator [Sashiko [2]]
[2] https://sashiko.dev/#/patchset/20260317195436.2276810-1-hca%40linux.ibm.com
v1:
This is a follow-up to Peter Zijlstra's in-kernel rseq RFC [1].
With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs,
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.
In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.
To avoid this Peter suggested an in-kernel rseq approach. While this would
certainly work, this series tries to come up with a solution which uses
less instructions and doesn't require to repeat instruction sequences.
The idea is that this_cpu operations based on atomic instructions are
guarded with mvyi instructions:
- The first mvyi instruction writes the register number, which contains
the percpu address variable to lowcore. This also indicates that a
percpu code section is executed.
- The first instruction following the mvyi instruction must be the ag
instruction which adds the percpu offset to the percpu address register.
- Afterwards the atomic percpu operation follows.
- Then a second mvyi instruction writes a zero to lowcore, which indicates
the end of the percpu code section.
- In case of an interrupt/exception/nmi the register number which was
written to lowcore is copied to the exception frame (pt_regs), and a zero
is written to lowcore.
- On return to the previous context it is checked if a percpu code section
was executed (saved register number not zero), and if the process was
migrated to a different cpu. If the percpu offset was already added to
the percpu address register (instruction address does _not_ point to the
ag instruction) the content of the percpu address register is adjusted so
it points to percpu variable of the new cpu.
All of this seems to work, but of course it could still be broken since I
missed some detail.
In total this series results in a kernel text size reduction of ~106kb. The
number of preempt_schedule_notrace() call sites is reduced from 7089 to
1577.
Note: this comes without any huge performance analysis, however all
microbenchmarks confirmed that the new code is at least as fast as the
old code, like expected.
[1] 20260223163843.GR1282955@noisy.programming.kicks-ass.net
Heiko Carstens (7):
s390/percpu: Infrastructure for more efficient this_cpu operations
s390/percpu: Add missing do { } while (0) constructs
s390/percpu: Use new percpu code section for arch_this_cpu_add()
s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
s390/percpu: Provide arch_this_cpu_read() implementation
s390/percpu: Provide arch_this_cpu_write() implementation
arch/s390/include/asm/entry-percpu.h | 80 +++++++++
arch/s390/include/asm/lowcore.h | 3 +-
arch/s390/include/asm/percpu.h | 246 ++++++++++++++++++++++-----
arch/s390/include/asm/ptrace.h | 2 +
arch/s390/kernel/irq.c | 24 ++-
arch/s390/kernel/nmi.c | 5 +
arch/s390/kernel/traps.c | 5 +
7 files changed, 318 insertions(+), 47 deletions(-)
create mode 100644 arch/s390/include/asm/entry-percpu.h
--
2.51.0
^ permalink raw reply [flat|nested] 22+ messages in thread
* [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-05-26 5:56 [PATCH v5 0/7] s390: Improve this_cpu operations Heiko Carstens
@ 2026-05-26 5:56 ` Heiko Carstens
2026-06-01 7:10 ` Alexander Gordeev
2026-06-02 10:08 ` Alexander Gordeev
2026-05-26 5:56 ` [PATCH v5 2/7] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
` (5 subsequent siblings)
6 siblings, 2 replies; 22+ messages in thread
From: Heiko Carstens @ 2026-05-26 5:56 UTC (permalink / raw)
To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ
Cc: linux-kernel, linux-s390
With the intended removal of PREEMPT_NONE this_cpu operations based on
atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
become more expensive: the preempt_disable() / preempt_enable() pairs are
not optimized away anymore during compile time.
In particular the conditional call to preempt_schedule_notrace() after
preempt_enable() adds additional code and register pressure.
E.g. this simple C code sequence
DEFINE_PER_CPU(long, foo);
long bar(long a) { return this_cpu_add_return(foo, a); }
generates this code:
11a976: eb af f0 68 00 24 stmg %r10,%r15,104(%r15)
11a97c: b9 04 00 ef lgr %r14,%r15
11a980: b9 04 00 b2 lgr %r11,%r2
11a984: e3 f0 ff c8 ff 71 lay %r15,-56(%r15)
11a98a: e3 e0 f0 98 00 24 stg %r14,152(%r15)
11a990: eb 01 03 a8 00 6a asi 936,1 <- __preempt_count_add(1)
11a996: c0 10 00 d2 ac b5 larl %r1,1b70300 <- address of percpu var
11a9a0: e3 10 23 b8 00 08 ag %r1,952 <- add percpu offset
11a9a6: eb ab 10 00 00 e8 laag %r10,%r11,0(%r1) <- atomic op
11a9ac: eb ff 03 a8 00 6e alsi 936,-1 <- __preempt_count_dec_and_test()
11a9b2: a7 54 00 05 jnhe 11a9bc <bar+0x4c>
11a9b6: c0 e5 00 76 d1 bd brasl %r14,ff4d30 <preempt_schedule_notrace>
11a9bc: b9 e8 b0 2a agrk %r2,%r10,%r11
11a9c0: eb af f0 a0 00 04 lmg %r10,%r15,160(%r15)
11a9c6 07 fe br %r14
Even though the above example is more or less the worst case, since the
branch to preempt_schedule_notrace() requires a stackframe, which
otherwise wouldn't be necessary, there is also the conditional jnhe branch
instruction.
Get rid of the conditional branch with the following code sequence:
11a8e6: c0 30 00 d0 c5 0d larl %r3,1b33300
11a8ec: b9 04 00 43 lgr %r4,%r3
11a8f0: eb 00 43 c0 00 52 mviy 960,4
11a8f6: e3 40 03 b8 00 08 ag %r4,952
11a8fc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
11a902: eb 00 03 c0 00 52 mviy 960,0
11a908: b9 08 00 25 agr %r2,%r5
11a90c 07 fe br %r14
The general idea is that this_cpu operations based on atomic instructions
are guarded with mviy instructions:
- The first mviy instruction writes the register number, which contains
the percpu address variable to lowcore. This also indicates that a
percpu code section is executed.
- The first instruction following the mviy instruction must be the ag
instruction which adds the percpu offset to the percpu address register.
- Afterwards the atomic percpu operation follows.
- Then a second mviy instruction writes a zero to lowcore, which indicates
the end of the percpu code section.
- In case of an interrupt/exception/nmi the register number which was
written to lowcore is copied to the exception frame (pt_regs), and a zero
is written to lowcore.
- On return to the previous context it is checked if a percpu code section
was executed (saved register number not zero), and if the process was
migrated to a different cpu. If the percpu offset was already added to
the percpu address register (instruction address does _not_ point to the
ag instruction) the content of the percpu address register is adjusted so
it points to percpu variable of the new cpu.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
arch/s390/include/asm/entry-percpu.h | 80 ++++++++++++++++++++++++++++
arch/s390/include/asm/lowcore.h | 3 +-
arch/s390/include/asm/percpu.h | 62 +++++++++++++++++++++
arch/s390/include/asm/ptrace.h | 2 +
arch/s390/kernel/irq.c | 24 ++++++---
arch/s390/kernel/nmi.c | 5 ++
arch/s390/kernel/traps.c | 5 ++
7 files changed, 173 insertions(+), 8 deletions(-)
create mode 100644 arch/s390/include/asm/entry-percpu.h
diff --git a/arch/s390/include/asm/entry-percpu.h b/arch/s390/include/asm/entry-percpu.h
new file mode 100644
index 000000000000..4ef47265cfce
--- /dev/null
+++ b/arch/s390/include/asm/entry-percpu.h
@@ -0,0 +1,80 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef ARCH_S390_ENTRY_PERCPU_H
+#define ARCH_S390_ENTRY_PERCPU_H
+
+#include <linux/kprobes.h>
+#include <linux/percpu.h>
+#include <asm/lowcore.h>
+#include <asm/ptrace.h>
+#include <asm/asm-offsets.h>
+
+static __always_inline void percpu_entry(struct pt_regs *regs)
+{
+ struct lowcore *lc = get_lowcore();
+
+ if (user_mode(regs))
+ return;
+ regs->cpu = lc->cpu_nr;
+ regs->percpu_register = lc->percpu_register;
+ lc->percpu_register = 0;
+}
+
+static __always_inline bool percpu_code_check(struct pt_regs *regs)
+{
+ unsigned int insn, disp;
+ struct kprobe *p;
+
+ if (likely(user_mode(regs) || !regs->percpu_register))
+ return false;
+ /*
+ * Within a percpu code section - check if the percpu base register
+ * needs to be updated. This is the case if the PSW does not point to
+ * the ADD instruction within the section.
+ * - AG %rx,percpu_offset_in_lowcore(%r0,%r0)
+ * which adds the percpu offset to the percpu base register.
+ */
+ lockdep_assert_preemption_disabled();
+again:
+ insn = READ_ONCE(*(u16 *)psw_bits(regs->psw).ia);
+ if (unlikely(insn == BREAKPOINT_INSTRUCTION)) {
+ p = get_kprobe((void *)psw_bits(regs->psw).ia);
+ /*
+ * If the kprobe is concurrently removed on a different CPU
+ * it might not be found anymore. However text must have
+ * been restored - try again.
+ */
+ if (!p)
+ goto again;
+ insn = p->opcode;
+ }
+ if ((insn & 0xff0f) != 0xe300)
+ return true;
+ disp = offsetof(struct lowcore, percpu_offset);
+ if (machine_has_relocated_lowcore())
+ disp += LOWCORE_ALT_ADDRESS;
+ insn = (disp & 0xff000) >> 4 | (disp & 0x00fff) << 16 | 0x8;
+ if (*(u32 *)(psw_bits(regs->psw).ia + 2) != insn)
+ return true;
+ return false;
+}
+
+static __always_inline void percpu_exit(struct pt_regs *regs, bool needs_fixup)
+{
+ struct lowcore *lc = get_lowcore();
+ unsigned char reg;
+
+ if (user_mode(regs))
+ return;
+ reg = regs->percpu_register;
+ lc->percpu_register = reg;
+ if (likely(!needs_fixup))
+ return;
+ /* Check if process has been migrated to a different CPU. */
+ if (regs->cpu == lc->cpu_nr)
+ return;
+ /* Fixup percpu base register */
+ regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
+ regs->gprs[reg] += lc->percpu_offset;
+}
+
+#endif
diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
index 50ffe75adeb4..cd1ddfdb5d35 100644
--- a/arch/s390/include/asm/lowcore.h
+++ b/arch/s390/include/asm/lowcore.h
@@ -165,7 +165,8 @@ struct lowcore {
__u32 spinlock_index; /* 0x03b0 */
__u8 pad_0x03b4[0x03b8-0x03b4]; /* 0x03b4 */
__u64 percpu_offset; /* 0x03b8 */
- __u8 pad_0x03c0[0x0400-0x03c0]; /* 0x03c0 */
+ __u8 percpu_register; /* 0x03c0 */
+ __u8 pad_0x03c1[0x0400-0x03c1]; /* 0x03c1 */
__u32 return_lpswe; /* 0x0400 */
__u32 return_mcck_lpswe; /* 0x0404 */
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index b18a96f3a334..78602d2f5eba 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -60,6 +60,68 @@
#define this_cpu_or_1(pcp, val) arch_this_cpu_to_op_simple(pcp, val, |)
#define this_cpu_or_2(pcp, val) arch_this_cpu_to_op_simple(pcp, val, |)
+/*
+ * Macros to be used for percpu code section based on atomic instructions.
+ *
+ * Avoid the need to use preempt_disable() / preempt_disable() pairs and the
+ * conditional preempt_schedule_notrace() function calls which come with
+ * this. The idea is that this_cpu operations based on atomic instructions are
+ * guarded with mviy instructions:
+ *
+ * - The first mviy instruction writes the register number, which contains the
+ * percpu address variable to lowcore. This also indicates that a percpu
+ * code section is executed.
+ *
+ * - The first mviy instruction following the mviy instruction must be the ag
+ * instruction which adds the percpu offset to the percpu address register.
+ *
+ * - Afterwards the atomic percpu operation follows.
+ *
+ * - Then a second mviy instruction writes a zero to lowcore, which indicates
+ * the end of the percpu code section.
+ *
+ * - In case of an interrupt/exception/nmi the register number which was
+ * written to lowcore is copied to the exception frame (pt_regs), and a zero
+ * is written to lowcore.
+ *
+ * - On return to the previous context it is checked if a percpu code section
+ * was executed (saved register number not zero), and if the process was
+ * migrated to a different cpu. If the percpu offset was already added to
+ * the percpu address register (instruction address does _not_ point to the
+ * ag instruction) the content of the percpu address register is adjusted so
+ * it points to percpu variable of the new cpu.
+ *
+ * Inline assemblies making use of this typically have a code sequence like:
+ *
+ * MVIY_PERCPU(...) <- start of percpu code section
+ * AG_ALT(...) <- add percpu offset; must be the second instruction
+ * atomic_op <- atomic op
+ * MVIY_ALT(...) <- end of percpu code section
+ */
+
+#define MVIY_PERCPU(disp, dispalt, reg) \
+ ".macro GEN_MVIY disp reg\n" \
+ ".irp rs,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15\n" \
+ " .ifc \\reg,%%r\\rs\n" \
+ " mviy \\disp(%%r0),\\rs\n" \
+ " .endif\n" \
+ ".endr\n" \
+ ".endm\n" \
+ ALTERNATIVE("GEN_MVIY " __stringify(disp) " " __stringify(reg) "\n", \
+ "GEN_MVIY " __stringify(dispalt) " " __stringify(reg) "\n", \
+ ALT_FEATURE(MFEATURE_LOWCORE)) \
+ ".purgem GEN_MVIY\n"
+
+#define MVIY_ALT(disp, dispalt) \
+ ALTERNATIVE(" mviy " disp "(%%r0),0\n", \
+ " mviy " dispalt "(%%r0),0\n", \
+ ALT_FEATURE(MFEATURE_LOWCORE))
+
+#define AG_ALT(disp, dispalt, reg) \
+ ALTERNATIVE(" ag " reg ", " disp "(%%r0)\n", \
+ " ag " reg ", " dispalt "(%%r0)\n", \
+ ALT_FEATURE(MFEATURE_LOWCORE))
+
#ifndef MARCH_HAS_Z196_FEATURES
#define this_cpu_add_4(pcp, val) arch_this_cpu_to_op_simple(pcp, val, +)
diff --git a/arch/s390/include/asm/ptrace.h b/arch/s390/include/asm/ptrace.h
index aaceb1d9110a..495e310c3d6d 100644
--- a/arch/s390/include/asm/ptrace.h
+++ b/arch/s390/include/asm/ptrace.h
@@ -134,6 +134,8 @@ struct pt_regs {
};
unsigned long flags;
unsigned long last_break;
+ unsigned int cpu;
+ unsigned char percpu_register;
};
/*
diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
index d10a17e6531d..efb988833c88 100644
--- a/arch/s390/kernel/irq.c
+++ b/arch/s390/kernel/irq.c
@@ -33,6 +33,7 @@
#include <asm/softirq_stack.h>
#include <asm/vtime.h>
#include <asm/asm.h>
+#include <asm/entry-percpu.h>
#include "entry.h"
DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
@@ -142,10 +143,13 @@ static int irq_pending(struct pt_regs *regs)
void noinstr do_io_irq(struct pt_regs *regs)
{
- irqentry_state_t state = irqentry_enter(regs);
- struct pt_regs *old_regs = set_irq_regs(regs);
- bool from_idle;
+ bool from_idle, percpu_needs_fixup;
+ struct pt_regs *old_regs;
+ irqentry_state_t state;
+ percpu_entry(regs);
+ state = irqentry_enter(regs);
+ old_regs = set_irq_regs(regs);
from_idle = test_and_clear_cpu_flag(CIF_ENABLED_WAIT);
if (from_idle)
update_timer_idle();
@@ -170,21 +174,25 @@ void noinstr do_io_irq(struct pt_regs *regs)
do_irq_async(regs, IO_INTERRUPT);
} while (machine_is_lpar() && irq_pending(regs));
+ percpu_needs_fixup = percpu_code_check(regs);
irq_exit_rcu();
-
set_irq_regs(old_regs);
irqentry_exit(regs, state);
if (from_idle)
regs->psw.mask &= ~(PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_WAIT);
+ percpu_exit(regs, percpu_needs_fixup);
}
void noinstr do_ext_irq(struct pt_regs *regs)
{
- irqentry_state_t state = irqentry_enter(regs);
- struct pt_regs *old_regs = set_irq_regs(regs);
- bool from_idle;
+ bool from_idle, percpu_needs_fixup;
+ struct pt_regs *old_regs;
+ irqentry_state_t state;
+ percpu_entry(regs);
+ state = irqentry_enter(regs);
+ old_regs = set_irq_regs(regs);
from_idle = test_and_clear_cpu_flag(CIF_ENABLED_WAIT);
if (from_idle)
update_timer_idle();
@@ -206,12 +214,14 @@ void noinstr do_ext_irq(struct pt_regs *regs)
do_irq_async(regs, EXT_INTERRUPT);
+ percpu_needs_fixup = percpu_code_check(regs);
irq_exit_rcu();
set_irq_regs(old_regs);
irqentry_exit(regs, state);
if (from_idle)
regs->psw.mask &= ~(PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_WAIT);
+ percpu_exit(regs, percpu_needs_fixup);
}
static void show_msi_interrupt(struct seq_file *p, int irq)
diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
index 94fbfad49f62..e17a59d4d5a4 100644
--- a/arch/s390/kernel/nmi.c
+++ b/arch/s390/kernel/nmi.c
@@ -22,6 +22,7 @@
#include <linux/module.h>
#include <linux/sched/signal.h>
#include <linux/kvm_host.h>
+#include <asm/entry-percpu.h>
#include <asm/lowcore.h>
#include <asm/ctlreg.h>
#include <asm/fpu.h>
@@ -363,6 +364,7 @@ NOKPROBE_SYMBOL(s390_backup_mcck_info);
*/
void notrace s390_do_machine_check(struct pt_regs *regs)
{
+ bool percpu_needs_fixup;
static int ipd_count;
static DEFINE_SPINLOCK(ipd_lock);
static unsigned long long last_ipd;
@@ -374,6 +376,7 @@ void notrace s390_do_machine_check(struct pt_regs *regs)
unsigned long mcck_dam_code;
int mcck_pending = 0;
+ percpu_entry(regs);
irq_state = irqentry_nmi_enter(regs);
if (user_mode(regs))
@@ -495,7 +498,9 @@ void notrace s390_do_machine_check(struct pt_regs *regs)
if (mcck_pending)
schedule_mcck_handler();
+ percpu_needs_fixup = percpu_code_check(regs);
irqentry_nmi_exit(regs, irq_state);
+ percpu_exit(regs, percpu_needs_fixup);
}
NOKPROBE_SYMBOL(s390_do_machine_check);
diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
index 1b5c6fc431cc..564403496a7c 100644
--- a/arch/s390/kernel/traps.c
+++ b/arch/s390/kernel/traps.c
@@ -24,6 +24,7 @@
#include <linux/entry-common.h>
#include <linux/kmsan.h>
#include <linux/bug.h>
+#include <asm/entry-percpu.h>
#include <asm/asm-extable.h>
#include <asm/irqflags.h>
#include <asm/ptrace.h>
@@ -329,6 +330,7 @@ static void (*pgm_check_table[128])(struct pt_regs *regs);
void noinstr __do_pgm_check(struct pt_regs *regs)
{
struct lowcore *lc = get_lowcore();
+ bool percpu_needs_fixup;
irqentry_state_t state;
unsigned int trapnr;
union teid teid;
@@ -349,6 +351,7 @@ void noinstr __do_pgm_check(struct pt_regs *regs)
current->thread.gmap_int_code = regs->int_code & 0xffff;
return;
}
+ percpu_entry(regs);
state = irqentry_enter(regs);
if (user_mode(regs)) {
update_timer_sys();
@@ -385,7 +388,9 @@ void noinstr __do_pgm_check(struct pt_regs *regs)
pgm_check_table[trapnr](regs);
out:
local_irq_disable();
+ percpu_needs_fixup = percpu_code_check(regs);
irqentry_exit(regs, state);
+ percpu_exit(regs, percpu_needs_fixup);
}
/*
--
2.51.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 2/7] s390/percpu: Add missing do { } while (0) constructs
2026-05-26 5:56 [PATCH v5 0/7] s390: Improve this_cpu operations Heiko Carstens
2026-05-26 5:56 ` [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient " Heiko Carstens
@ 2026-05-26 5:56 ` Heiko Carstens
2026-06-01 7:14 ` Alexander Gordeev
2026-05-26 5:56 ` [PATCH v5 3/7] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
` (4 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-05-26 5:56 UTC (permalink / raw)
To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ
Cc: linux-kernel, linux-s390
Add missing do { } while (0) constructs in order to avoid potential
build failures.
Reported-by: Sashiko <sashiko-bot@kernel.org>
Closes: https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca%40linux.ibm.com
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
arch/s390/include/asm/percpu.h | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 78602d2f5eba..79d5a4460b18 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -136,7 +136,7 @@
#else /* MARCH_HAS_Z196_FEATURES */
#define arch_this_cpu_add(pcp, val, op1, op2, szcast) \
-{ \
+do { \
typedef typeof(pcp) pcp_op_T__; \
pcp_op_T__ val__ = (val); \
pcp_op_T__ old__, *ptr__; \
@@ -157,7 +157,7 @@
: "cc"); \
} \
preempt_enable_notrace(); \
-}
+} while (0)
#define this_cpu_add_4(pcp, val) arch_this_cpu_add(pcp, val, "laa", "asi", int)
#define this_cpu_add_8(pcp, val) arch_this_cpu_add(pcp, val, "laag", "agsi", long)
@@ -182,7 +182,7 @@
#define this_cpu_add_return_8(pcp, val) arch_this_cpu_add_return(pcp, val, "laag")
#define arch_this_cpu_to_op(pcp, val, op) \
-{ \
+do { \
typedef typeof(pcp) pcp_op_T__; \
pcp_op_T__ val__ = (val); \
pcp_op_T__ old__, *ptr__; \
@@ -194,7 +194,7 @@
: [val__] "d" (val__) \
: "cc"); \
preempt_enable_notrace(); \
-}
+} while (0)
#define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan")
#define this_cpu_and_8(pcp, val) arch_this_cpu_to_op(pcp, val, "lang")
--
2.51.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 3/7] s390/percpu: Use new percpu code section for arch_this_cpu_add()
2026-05-26 5:56 [PATCH v5 0/7] s390: Improve this_cpu operations Heiko Carstens
2026-05-26 5:56 ` [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient " Heiko Carstens
2026-05-26 5:56 ` [PATCH v5 2/7] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
@ 2026-05-26 5:56 ` Heiko Carstens
2026-06-02 11:33 ` Alexander Gordeev
2026-05-26 5:56 ` [PATCH v5 4/7] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
` (3 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-05-26 5:56 UTC (permalink / raw)
To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ
Cc: linux-kernel, linux-s390
Convert arch_this_cpu_add() to make use of the new percpu code section
infrastructure.
With this the text size of the kernel image is reduced by ~76kb
(defconfig). Also more than 5300 generated preempt_schedule_notrace()
function calls within the kernel image (modules not counted) are removed.
With:
DEFINE_PER_CPU(long, foo);
void bar(long a) { this_cpu_add(foo, a); }
Old arch_this_cpu_add() looks like this:
00000000000000c0 <bar>:
c0: c0 04 00 00 00 00 jgnop c0 <bar>
c6: eb 01 03 a8 00 6a asi 936,1
cc: c4 18 00 00 00 00 lgrl %r1,cc <bar+0xc>
ce: R_390_GOTENT foo+0x2
d2: e3 10 03 b8 00 08 ag %r1,952
d8: eb 22 10 00 00 e8 laag %r2,%r2,0(%r1)
de: eb ff 03 a8 00 6e alsi 936,-1
e4: a7 a4 00 05 jhe ee <bar+0x2e>
e8: c0 f4 00 00 00 00 jg e8 <bar+0x28>
ea: R_390_PC32DBL __s390_indirect_jump_r14+0x2
ee: c0 f4 00 00 00 00 jg ee <bar+0x2e>
f0: R_390_PLT32DBL preempt_schedule_notrace+0x2
New arch_this_cpu_add() looks like this:
00000000000000c0 <bar>:
c0: c0 04 00 00 00 00 jgnop c0 <bar>
c6: c4 38 00 00 00 00 lgrl %r3,c6 <bar+0x6>
c8: R_390_GOTENT foo+0x2
cc: b9 04 00 43 lgr %r4,%r3
d0: eb 00 43 c0 00 52 mviy 960(%r0),4
d6: e3 40 03 b8 00 08 ag %r4,952
dc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
e2: eb 00 03 c0 00 52 mviy 960,0
e8: c0 f4 00 00 00 00 jg e8 <bar+0x28>
ea: R_390_PC32DBL __s390_indirect_jump_r14+0x2
Note that the conditional function call is removed.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
arch/s390/include/asm/percpu.h | 65 ++++++++++++++++++++++------------
1 file changed, 43 insertions(+), 22 deletions(-)
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 79d5a4460b18..9140d81b7efc 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -135,28 +135,49 @@
#else /* MARCH_HAS_Z196_FEATURES */
-#define arch_this_cpu_add(pcp, val, op1, op2, szcast) \
-do { \
- typedef typeof(pcp) pcp_op_T__; \
- pcp_op_T__ val__ = (val); \
- pcp_op_T__ old__, *ptr__; \
- preempt_disable_notrace(); \
- ptr__ = raw_cpu_ptr(&(pcp)); \
- if (__builtin_constant_p(val__) && \
- ((szcast)val__ > -129) && ((szcast)val__ < 128)) { \
- asm volatile( \
- op2 " %[ptr__],%[val__]" \
- : [ptr__] "+Q" (*ptr__) \
- : [val__] "i" ((szcast)val__) \
- : "cc"); \
- } else { \
- asm volatile( \
- op1 " %[old__],%[val__],%[ptr__]" \
- : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \
- : [val__] "d" (val__) \
- : "cc"); \
- } \
- preempt_enable_notrace(); \
+#define arch_this_cpu_add(pcp, val, op1, op2, szcast) \
+do { \
+ unsigned long lc_pcpr, lc_pcpo; \
+ typedef typeof(pcp) pcp_op_T__; \
+ pcp_op_T__ val__ = (val); \
+ pcp_op_T__ old__, *ptr__; \
+ \
+ lc_pcpr = offsetof(struct lowcore, percpu_register); \
+ lc_pcpo = offsetof(struct lowcore, percpu_offset); \
+ ptr__ = PERCPU_PTR(&(pcp)); \
+ if (__builtin_constant_p(val__) && \
+ ((szcast)val__ > -129) && ((szcast)val__ < 128)) { \
+ asm volatile( \
+ MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+ AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]") \
+ op2 " 0(%[ptr__]),%[val__]\n" \
+ MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]") \
+ : [ptr__] "+&a" (ptr__), "+m" (*ptr__), \
+ "=m" (((struct lowcore *)0)->percpu_register) \
+ : [val__] "i" ((szcast)val__), \
+ [disppcpr] "i" (lc_pcpr), \
+ [disppcpo] "i" (lc_pcpo), \
+ [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS), \
+ [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS), \
+ "m" (((struct lowcore *)0)->percpu_offset) \
+ : "cc"); \
+ } else { \
+ asm volatile( \
+ MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+ AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]") \
+ op1 " %[old__],%[val__],0(%[ptr__])\n" \
+ MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]") \
+ : [old__] "=&d" (old__), \
+ [ptr__] "+&a" (ptr__), "+m" (*ptr__), \
+ "=m" (((struct lowcore *)0)->percpu_register) \
+ : [val__] "d" (val__), \
+ [disppcpr] "i" (lc_pcpr), \
+ [disppcpo] "i" (lc_pcpo), \
+ [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS), \
+ [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS), \
+ "m" (((struct lowcore *)0)->percpu_offset) \
+ : "cc"); \
+ } \
} while (0)
#define this_cpu_add_4(pcp, val) arch_this_cpu_add(pcp, val, "laa", "asi", int)
--
2.51.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 4/7] s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
2026-05-26 5:56 [PATCH v5 0/7] s390: Improve this_cpu operations Heiko Carstens
` (2 preceding siblings ...)
2026-05-26 5:56 ` [PATCH v5 3/7] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
@ 2026-05-26 5:56 ` Heiko Carstens
2026-06-02 11:34 ` Alexander Gordeev
2026-05-26 5:57 ` [PATCH v5 5/7] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
` (2 subsequent siblings)
6 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-05-26 5:56 UTC (permalink / raw)
To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ
Cc: linux-kernel, linux-s390
Convert arch_this_cpu_add_return() to make use of the new percpu code
section infrastructure.
With this the text size of the kernel image is reduced by ~4k
(defconfig). Also 66 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
arch/s390/include/asm/percpu.h | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 9140d81b7efc..f2d0e0354582 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -185,17 +185,29 @@ do { \
#define arch_this_cpu_add_return(pcp, val, op) \
({ \
+ unsigned long lc_pcpr, lc_pcpo; \
typedef typeof(pcp) pcp_op_T__; \
pcp_op_T__ val__ = (val); \
pcp_op_T__ old__, *ptr__; \
- preempt_disable_notrace(); \
- ptr__ = raw_cpu_ptr(&(pcp)); \
- asm volatile( \
- op " %[old__],%[val__],%[ptr__]" \
- : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \
- : [val__] "d" (val__) \
+ \
+ lc_pcpr = offsetof(struct lowcore, percpu_register); \
+ lc_pcpo = offsetof(struct lowcore, percpu_offset); \
+ ptr__ = PERCPU_PTR(&(pcp)); \
+ asm_inline volatile( \
+ MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+ AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]") \
+ op " %[old__],%[val__],0(%[ptr__])\n" \
+ MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]") \
+ : [old__] "=&d" (old__), \
+ [ptr__] "+&a" (ptr__), "+m" (*ptr__), \
+ "=m" (((struct lowcore *)0)->percpu_register) \
+ : [val__] "d" (val__), \
+ [disppcpr] "i" (lc_pcpr), \
+ [disppcpo] "i" (lc_pcpo), \
+ [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS), \
+ [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS), \
+ "m" (((struct lowcore *)0)->percpu_offset) \
: "cc"); \
- preempt_enable_notrace(); \
old__ + val__; \
})
--
2.51.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 5/7] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
2026-05-26 5:56 [PATCH v5 0/7] s390: Improve this_cpu operations Heiko Carstens
` (3 preceding siblings ...)
2026-05-26 5:56 ` [PATCH v5 4/7] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
@ 2026-05-26 5:57 ` Heiko Carstens
2026-06-02 11:35 ` Alexander Gordeev
2026-05-26 5:57 ` [PATCH v5 6/7] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
2026-05-26 5:57 ` [PATCH v5 7/7] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
6 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-05-26 5:57 UTC (permalink / raw)
To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ
Cc: linux-kernel, linux-s390
Convert arch_this_cpu_[and|or]() to make use of the new percpu code
section infrastructure.
There is no user of this_cpu_and() and only one user of this_cpu_or()
within the kernel. Therefore this conversion has hardly any effect,
and also removes only preempt_schedule_notrace() function call.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
arch/s390/include/asm/percpu.h | 26 +++++++++++++++++++-------
1 file changed, 19 insertions(+), 7 deletions(-)
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index f2d0e0354582..5e0185e5960b 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -216,17 +216,29 @@ do { \
#define arch_this_cpu_to_op(pcp, val, op) \
do { \
+ unsigned long lc_pcpr, lc_pcpo; \
typedef typeof(pcp) pcp_op_T__; \
pcp_op_T__ val__ = (val); \
pcp_op_T__ old__, *ptr__; \
- preempt_disable_notrace(); \
- ptr__ = raw_cpu_ptr(&(pcp)); \
- asm volatile( \
- op " %[old__],%[val__],%[ptr__]" \
- : [old__] "=d" (old__), [ptr__] "+Q" (*ptr__) \
- : [val__] "d" (val__) \
+ \
+ lc_pcpr = offsetof(struct lowcore, percpu_register); \
+ lc_pcpo = offsetof(struct lowcore, percpu_offset); \
+ ptr__ = PERCPU_PTR(&(pcp)); \
+ asm_inline volatile( \
+ MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+ AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]") \
+ op " %[old__],%[val__],0(%[ptr__])\n" \
+ MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]") \
+ : [old__] "=&d" (old__), \
+ [ptr__] "+&a" (ptr__), "+m" (*ptr__), \
+ "=m" (((struct lowcore *)0)->percpu_register) \
+ : [val__] "d" (val__), \
+ [disppcpr] "i" (lc_pcpr), \
+ [disppcpo] "i" (lc_pcpo), \
+ [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS), \
+ [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS), \
+ "m" (((struct lowcore *)0)->percpu_offset) \
: "cc"); \
- preempt_enable_notrace(); \
} while (0)
#define this_cpu_and_4(pcp, val) arch_this_cpu_to_op(pcp, val, "lan")
--
2.51.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 6/7] s390/percpu: Provide arch_this_cpu_read() implementation
2026-05-26 5:56 [PATCH v5 0/7] s390: Improve this_cpu operations Heiko Carstens
` (4 preceding siblings ...)
2026-05-26 5:57 ` [PATCH v5 5/7] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
@ 2026-05-26 5:57 ` Heiko Carstens
2026-06-02 11:40 ` Alexander Gordeev
2026-05-26 5:57 ` [PATCH v5 7/7] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
6 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-05-26 5:57 UTC (permalink / raw)
To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ
Cc: linux-kernel, linux-s390
Provide an s390 specific implementation of arch_this_cpu_read() instead
of the generic variant. The generic variant uses preempt_disable() /
preempt_enable() pair and READ_ONCE().
Get rid of the preempt_disable() / preempt_enable() pairs by providing an
own variant which makes use of the new percpu code section infrastructure.
With this the text size of the kernel image is reduced by ~1k
(defconfig). Also 87 generated preempt_schedule_notrace() function
calls within the kernel image (modules not counted) are removed.
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
arch/s390/include/asm/percpu.h | 31 +++++++++++++++++++++++++++++++
1 file changed, 31 insertions(+)
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 5e0185e5960b..83195a5dc409 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -248,6 +248,37 @@ do { \
#endif /* MARCH_HAS_Z196_FEATURES */
+#define arch_this_cpu_read(pcp, op) \
+({ \
+ unsigned long lc_pcpr, lc_pcpo, res__; \
+ typedef typeof(pcp) pcp_op_T__; \
+ pcp_op_T__ *ptr__; \
+ \
+ lc_pcpr = offsetof(struct lowcore, percpu_register); \
+ lc_pcpo = offsetof(struct lowcore, percpu_offset); \
+ ptr__ = PERCPU_PTR(&(pcp)); \
+ asm_inline volatile( \
+ MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+ AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]") \
+ op " %[res__],0(%[ptr__])\n" \
+ MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]") \
+ : [res__] "=&d" (res__), [ptr__] "+&a" (ptr__), \
+ "=m" (((struct lowcore *)0)->percpu_register) \
+ : [disppcpr] "i" (lc_pcpr), \
+ [disppcpo] "i" (lc_pcpo), \
+ [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS), \
+ [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS), \
+ "m" (*ptr__), \
+ "m" (((struct lowcore *)0)->percpu_offset) \
+ : "cc"); \
+ (pcp_op_T__)res__; \
+})
+
+#define this_cpu_read_1(pcp) arch_this_cpu_read(pcp, "llgc")
+#define this_cpu_read_2(pcp) arch_this_cpu_read(pcp, "llgh")
+#define this_cpu_read_4(pcp) arch_this_cpu_read(pcp, "llgf")
+#define this_cpu_read_8(pcp) arch_this_cpu_read(pcp, "lg")
+
#define arch_this_cpu_cmpxchg(pcp, oval, nval) \
({ \
typedef typeof(pcp) pcp_op_T__; \
--
2.51.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* [PATCH v5 7/7] s390/percpu: Provide arch_this_cpu_write() implementation
2026-05-26 5:56 [PATCH v5 0/7] s390: Improve this_cpu operations Heiko Carstens
` (5 preceding siblings ...)
2026-05-26 5:57 ` [PATCH v5 6/7] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
@ 2026-05-26 5:57 ` Heiko Carstens
2026-06-02 11:40 ` Alexander Gordeev
6 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-05-26 5:57 UTC (permalink / raw)
To: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ
Cc: linux-kernel, linux-s390
Provide an s390 specific implementation of arch_this_cpu_write()
instead of the generic variant. The generic variant uses a quite
expensive raw_local_irq_save() / raw_local_irq_restore() pair.
Get rid of this by providing an own variant which makes use of the new
percpu code section infrastructure.
With this the text size of the kernel image is reduced by ~1k (defconfig).
Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
---
arch/s390/include/asm/percpu.h | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
index 83195a5dc409..1d955dd0defa 100644
--- a/arch/s390/include/asm/percpu.h
+++ b/arch/s390/include/asm/percpu.h
@@ -279,6 +279,36 @@ do { \
#define this_cpu_read_4(pcp) arch_this_cpu_read(pcp, "llgf")
#define this_cpu_read_8(pcp) arch_this_cpu_read(pcp, "lg")
+#define arch_this_cpu_write(pcp, val, op) \
+do { \
+ unsigned long lc_pcpr, lc_pcpo; \
+ typedef typeof(pcp) pcp_op_T__; \
+ pcp_op_T__ *ptr__, val__ = (val); \
+ \
+ lc_pcpr = offsetof(struct lowcore, percpu_register); \
+ lc_pcpo = offsetof(struct lowcore, percpu_offset); \
+ ptr__ = PERCPU_PTR(&(pcp)); \
+ asm_inline volatile( \
+ MVIY_PERCPU("%[disppcpr]", "%[dispaltpcpr]", "%[ptr__]")\
+ AG_ALT("%[disppcpo]", "%[dispaltpcpo]", "%[ptr__]") \
+ op " %[val__],0(%[ptr__])\n" \
+ MVIY_ALT("%[disppcpr]", "%[dispaltpcpr]") \
+ : [ptr__] "+&a" (ptr__), "=m" (*ptr__), \
+ "=m" (((struct lowcore *)0)->percpu_register) \
+ : [val__] "d" (val__), \
+ [disppcpr] "i" (lc_pcpr), \
+ [disppcpo] "i" (lc_pcpo), \
+ [dispaltpcpr] "i" (lc_pcpr + LOWCORE_ALT_ADDRESS), \
+ [dispaltpcpo] "i" (lc_pcpo + LOWCORE_ALT_ADDRESS), \
+ "m" (((struct lowcore *)0)->percpu_offset) \
+ : "cc"); \
+} while (0)
+
+#define this_cpu_write_1(pcp, val) arch_this_cpu_write(pcp, val, "stc")
+#define this_cpu_write_2(pcp, val) arch_this_cpu_write(pcp, val, "sth")
+#define this_cpu_write_4(pcp, val) arch_this_cpu_write(pcp, val, "st")
+#define this_cpu_write_8(pcp, val) arch_this_cpu_write(pcp, val, "stg")
+
#define arch_this_cpu_cmpxchg(pcp, oval, nval) \
({ \
typedef typeof(pcp) pcp_op_T__; \
--
2.51.0
^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-05-26 5:56 ` [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient " Heiko Carstens
@ 2026-06-01 7:10 ` Alexander Gordeev
2026-06-01 13:27 ` Heiko Carstens
2026-06-02 10:08 ` Alexander Gordeev
1 sibling, 1 reply; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-01 7:10 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Tue, May 26, 2026 at 07:56:56AM +0200, Heiko Carstens wrote:
Hi Heiko,
Please find some nitpicks below.
> With the intended removal of PREEMPT_NONE this_cpu operations based on
> atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
> become more expensive: the preempt_disable() / preempt_enable() pairs are
> not optimized away anymore during compile time.
>
> In particular the conditional call to preempt_schedule_notrace() after
> preempt_enable() adds additional code and register pressure.
>
> E.g. this simple C code sequence
>
> DEFINE_PER_CPU(long, foo);
> long bar(long a) { return this_cpu_add_return(foo, a); }
>
> generates this code:
>
> 11a976: eb af f0 68 00 24 stmg %r10,%r15,104(%r15)
> 11a97c: b9 04 00 ef lgr %r14,%r15
> 11a980: b9 04 00 b2 lgr %r11,%r2
> 11a984: e3 f0 ff c8 ff 71 lay %r15,-56(%r15)
> 11a98a: e3 e0 f0 98 00 24 stg %r14,152(%r15)
> 11a990: eb 01 03 a8 00 6a asi 936,1 <- __preempt_count_add(1)
> 11a996: c0 10 00 d2 ac b5 larl %r1,1b70300 <- address of percpu var
> 11a9a0: e3 10 23 b8 00 08 ag %r1,952 <- add percpu offset
> 11a9a6: eb ab 10 00 00 e8 laag %r10,%r11,0(%r1) <- atomic op
> 11a9ac: eb ff 03 a8 00 6e alsi 936,-1 <- __preempt_count_dec_and_test()
> 11a9b2: a7 54 00 05 jnhe 11a9bc <bar+0x4c>
> 11a9b6: c0 e5 00 76 d1 bd brasl %r14,ff4d30 <preempt_schedule_notrace>
> 11a9bc: b9 e8 b0 2a agrk %r2,%r10,%r11
> 11a9c0: eb af f0 a0 00 04 lmg %r10,%r15,160(%r15)
> 11a9c6 07 fe br %r14
>
> Even though the above example is more or less the worst case, since the
> branch to preempt_schedule_notrace() requires a stackframe, which
> otherwise wouldn't be necessary, there is also the conditional jnhe branch
> instruction.
>
> Get rid of the conditional branch with the following code sequence:
>
> 11a8e6: c0 30 00 d0 c5 0d larl %r3,1b33300
Annotating with similar comments as above would help:
... <- address of percpu var
... <- add percpu offset
> 11a8ec: b9 04 00 43 lgr %r4,%r3
> 11a8f0: eb 00 43 c0 00 52 mviy 960,4
> 11a8f6: e3 40 03 b8 00 08 ag %r4,952
> 11a8fc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
> 11a902: eb 00 03 c0 00 52 mviy 960,0
> 11a908: b9 08 00 25 agr %r2,%r5
> 11a90c 07 fe br %r14
>
> The general idea is that this_cpu operations based on atomic instructions
> are guarded with mviy instructions:
>
> - The first mviy instruction writes the register number, which contains
> the percpu address variable to lowcore. This also indicates that a
> percpu code section is executed.
>
> - The first instruction following the mviy instruction must be the ag
> instruction which adds the percpu offset to the percpu address register.
>
> - Afterwards the atomic percpu operation follows.
>
> - Then a second mviy instruction writes a zero to lowcore, which indicates
> the end of the percpu code section.
>
> - In case of an interrupt/exception/nmi the register number which was
> written to lowcore is copied to the exception frame (pt_regs), and a zero
> is written to lowcore.
>
> - On return to the previous context it is checked if a percpu code section
> was executed (saved register number not zero), and if the process was
> migrated to a different cpu. If the percpu offset was already added to
> the percpu address register (instruction address does _not_ point to the
> ag instruction) the content of the percpu address register is adjusted so
> it points to percpu variable of the new cpu.
>
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> ---
> arch/s390/include/asm/entry-percpu.h | 80 ++++++++++++++++++++++++++++
> arch/s390/include/asm/lowcore.h | 3 +-
> arch/s390/include/asm/percpu.h | 62 +++++++++++++++++++++
> arch/s390/include/asm/ptrace.h | 2 +
> arch/s390/kernel/irq.c | 24 ++++++---
> arch/s390/kernel/nmi.c | 5 ++
> arch/s390/kernel/traps.c | 5 ++
> 7 files changed, 173 insertions(+), 8 deletions(-)
> create mode 100644 arch/s390/include/asm/entry-percpu.h
>
> diff --git a/arch/s390/include/asm/entry-percpu.h b/arch/s390/include/asm/entry-percpu.h
> new file mode 100644
> index 000000000000..4ef47265cfce
> --- /dev/null
> +++ b/arch/s390/include/asm/entry-percpu.h
> @@ -0,0 +1,80 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef ARCH_S390_ENTRY_PERCPU_H
> +#define ARCH_S390_ENTRY_PERCPU_H
> +
> +#include <linux/kprobes.h>
> +#include <linux/percpu.h>
> +#include <asm/lowcore.h>
> +#include <asm/ptrace.h>
> +#include <asm/asm-offsets.h>
> +
> +static __always_inline void percpu_entry(struct pt_regs *regs)
> +{
> + struct lowcore *lc = get_lowcore();
> +
> + if (user_mode(regs))
> + return;
> + regs->cpu = lc->cpu_nr;
> + regs->percpu_register = lc->percpu_register;
> + lc->percpu_register = 0;
This assignment is required when the task is migrated to a new CPU, so
the lowcore would not be corrupted with the stale percpu_register value.
Otherwise, it is restored in percpu_exit() if there was no migration.
Right?
> +}
> +
> +static __always_inline bool percpu_code_check(struct pt_regs *regs)
> +{
> + unsigned int insn, disp;
> + struct kprobe *p;
> +
> + if (likely(user_mode(regs) || !regs->percpu_register))
> + return false;
> + /*
> + * Within a percpu code section - check if the percpu base register
> + * needs to be updated. This is the case if the PSW does not point to
> + * the ADD instruction within the section.
> + * - AG %rx,percpu_offset_in_lowcore(%r0,%r0)
> + * which adds the percpu offset to the percpu base register.
> + */
> + lockdep_assert_preemption_disabled();
> +again:
> + insn = READ_ONCE(*(u16 *)psw_bits(regs->psw).ia);
> + if (unlikely(insn == BREAKPOINT_INSTRUCTION)) {
> + p = get_kprobe((void *)psw_bits(regs->psw).ia);
> + /*
> + * If the kprobe is concurrently removed on a different CPU
> + * it might not be found anymore. However text must have
> + * been restored - try again.
> + */
> + if (!p)
> + goto again;
> + insn = p->opcode;
> + }
> + if ((insn & 0xff0f) != 0xe300)
Any reason mask/not to check the register number as well?
> + return true;
> + disp = offsetof(struct lowcore, percpu_offset);
> + if (machine_has_relocated_lowcore())
> + disp += LOWCORE_ALT_ADDRESS;
> + insn = (disp & 0xff000) >> 4 | (disp & 0x00fff) << 16 | 0x8;
> + if (*(u32 *)(psw_bits(regs->psw).ia + 2) != insn)
> + return true;
> + return false;
> +}
> +
> +static __always_inline void percpu_exit(struct pt_regs *regs, bool needs_fixup)
> +{
> + struct lowcore *lc = get_lowcore();
> + unsigned char reg;
> +
> + if (user_mode(regs))
> + return;
> + reg = regs->percpu_register;
> + lc->percpu_register = reg;
> + if (likely(!needs_fixup))
> + return;
> + /* Check if process has been migrated to a different CPU. */
> + if (regs->cpu == lc->cpu_nr)
> + return;
> + /* Fixup percpu base register */
> + regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
> + regs->gprs[reg] += lc->percpu_offset;
Such one reads bit better:
regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
regs->gprs[reg] += __per_cpu_offset[reg];
> +}
> +
> +#endif
> diff --git a/arch/s390/include/asm/lowcore.h b/arch/s390/include/asm/lowcore.h
> index 50ffe75adeb4..cd1ddfdb5d35 100644
> --- a/arch/s390/include/asm/lowcore.h
> +++ b/arch/s390/include/asm/lowcore.h
> @@ -165,7 +165,8 @@ struct lowcore {
> __u32 spinlock_index; /* 0x03b0 */
> __u8 pad_0x03b4[0x03b8-0x03b4]; /* 0x03b4 */
> __u64 percpu_offset; /* 0x03b8 */
> - __u8 pad_0x03c0[0x0400-0x03c0]; /* 0x03c0 */
> + __u8 percpu_register; /* 0x03c0 */
> + __u8 pad_0x03c1[0x0400-0x03c1]; /* 0x03c1 */
>
> __u32 return_lpswe; /* 0x0400 */
> __u32 return_mcck_lpswe; /* 0x0404 */
> diff --git a/arch/s390/include/asm/percpu.h b/arch/s390/include/asm/percpu.h
> index b18a96f3a334..78602d2f5eba 100644
> --- a/arch/s390/include/asm/percpu.h
> +++ b/arch/s390/include/asm/percpu.h
> @@ -60,6 +60,68 @@
> #define this_cpu_or_1(pcp, val) arch_this_cpu_to_op_simple(pcp, val, |)
> #define this_cpu_or_2(pcp, val) arch_this_cpu_to_op_simple(pcp, val, |)
>
> +/*
> + * Macros to be used for percpu code section based on atomic instructions.
> + *
> + * Avoid the need to use preempt_disable() / preempt_disable() pairs and the
> + * conditional preempt_schedule_notrace() function calls which come with
> + * this. The idea is that this_cpu operations based on atomic instructions are
> + * guarded with mviy instructions:
> + *
> + * - The first mviy instruction writes the register number, which contains the
> + * percpu address variable to lowcore. This also indicates that a percpu
> + * code section is executed.
> + *
> + * - The first mviy instruction following the mviy instruction must be the ag
> + * instruction which adds the percpu offset to the percpu address register.
> + *
> + * - Afterwards the atomic percpu operation follows.
> + *
> + * - Then a second mviy instruction writes a zero to lowcore, which indicates
> + * the end of the percpu code section.
> + *
> + * - In case of an interrupt/exception/nmi the register number which was
> + * written to lowcore is copied to the exception frame (pt_regs), and a zero
> + * is written to lowcore.
> + *
> + * - On return to the previous context it is checked if a percpu code section
> + * was executed (saved register number not zero), and if the process was
> + * migrated to a different cpu. If the percpu offset was already added to
> + * the percpu address register (instruction address does _not_ point to the
> + * ag instruction) the content of the percpu address register is adjusted so
> + * it points to percpu variable of the new cpu.
> + *
> + * Inline assemblies making use of this typically have a code sequence like:
> + *
> + * MVIY_PERCPU(...) <- start of percpu code section
> + * AG_ALT(...) <- add percpu offset; must be the second instruction
> + * atomic_op <- atomic op
\t here, but should be spaces?
Probably it worth noting that no extra instructions between atomic_op
and MVIY_ALT may exist (e.g. in the future), especially ones that use
the percpu_register.
> + * MVIY_ALT(...) <- end of percpu code section
> + */
> +
> +#define MVIY_PERCPU(disp, dispalt, reg) \
> + ".macro GEN_MVIY disp reg\n" \
> + ".irp rs,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15\n" \
> + " .ifc \\reg,%%r\\rs\n" \
> + " mviy \\disp(%%r0),\\rs\n" \
> + " .endif\n" \
> + ".endr\n" \
> + ".endm\n" \
> + ALTERNATIVE("GEN_MVIY " __stringify(disp) " " __stringify(reg) "\n", \
> + "GEN_MVIY " __stringify(dispalt) " " __stringify(reg) "\n", \
> + ALT_FEATURE(MFEATURE_LOWCORE)) \
> + ".purgem GEN_MVIY\n"
> +
> +#define MVIY_ALT(disp, dispalt) \
> + ALTERNATIVE(" mviy " disp "(%%r0),0\n", \
> + " mviy " dispalt "(%%r0),0\n", \
> + ALT_FEATURE(MFEATURE_LOWCORE))
> +
> +#define AG_ALT(disp, dispalt, reg) \
> + ALTERNATIVE(" ag " reg ", " disp "(%%r0)\n", \
> + " ag " reg ", " dispalt "(%%r0)\n", \
> + ALT_FEATURE(MFEATURE_LOWCORE))
> +
> #ifndef MARCH_HAS_Z196_FEATURES
>
> #define this_cpu_add_4(pcp, val) arch_this_cpu_to_op_simple(pcp, val, +)
> diff --git a/arch/s390/include/asm/ptrace.h b/arch/s390/include/asm/ptrace.h
> index aaceb1d9110a..495e310c3d6d 100644
> --- a/arch/s390/include/asm/ptrace.h
> +++ b/arch/s390/include/asm/ptrace.h
> @@ -134,6 +134,8 @@ struct pt_regs {
> };
> unsigned long flags;
> unsigned long last_break;
> + unsigned int cpu;
> + unsigned char percpu_register;
May be name it as this_cpu and this_cpu_reg(ister)?
> };
>
> /*
> diff --git a/arch/s390/kernel/irq.c b/arch/s390/kernel/irq.c
> index d10a17e6531d..efb988833c88 100644
> --- a/arch/s390/kernel/irq.c
> +++ b/arch/s390/kernel/irq.c
> @@ -33,6 +33,7 @@
> #include <asm/softirq_stack.h>
> #include <asm/vtime.h>
> #include <asm/asm.h>
> +#include <asm/entry-percpu.h>
> #include "entry.h"
>
> DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_stat, irq_stat);
> @@ -142,10 +143,13 @@ static int irq_pending(struct pt_regs *regs)
>
> void noinstr do_io_irq(struct pt_regs *regs)
> {
> - irqentry_state_t state = irqentry_enter(regs);
> - struct pt_regs *old_regs = set_irq_regs(regs);
> - bool from_idle;
> + bool from_idle, percpu_needs_fixup;
> + struct pt_regs *old_regs;
> + irqentry_state_t state;
>
> + percpu_entry(regs);
> + state = irqentry_enter(regs);
> + old_regs = set_irq_regs(regs);
> from_idle = test_and_clear_cpu_flag(CIF_ENABLED_WAIT);
> if (from_idle)
> update_timer_idle();
> @@ -170,21 +174,25 @@ void noinstr do_io_irq(struct pt_regs *regs)
> do_irq_async(regs, IO_INTERRUPT);
> } while (machine_is_lpar() && irq_pending(regs));
>
> + percpu_needs_fixup = percpu_code_check(regs);
> irq_exit_rcu();
> -
> set_irq_regs(old_regs);
> irqentry_exit(regs, state);
>
> if (from_idle)
> regs->psw.mask &= ~(PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_WAIT);
> + percpu_exit(regs, percpu_needs_fixup);
> }
>
> void noinstr do_ext_irq(struct pt_regs *regs)
> {
> - irqentry_state_t state = irqentry_enter(regs);
> - struct pt_regs *old_regs = set_irq_regs(regs);
> - bool from_idle;
> + bool from_idle, percpu_needs_fixup;
> + struct pt_regs *old_regs;
> + irqentry_state_t state;
>
> + percpu_entry(regs);
> + state = irqentry_enter(regs);
> + old_regs = set_irq_regs(regs);
> from_idle = test_and_clear_cpu_flag(CIF_ENABLED_WAIT);
> if (from_idle)
> update_timer_idle();
> @@ -206,12 +214,14 @@ void noinstr do_ext_irq(struct pt_regs *regs)
>
> do_irq_async(regs, EXT_INTERRUPT);
>
> + percpu_needs_fixup = percpu_code_check(regs);
> irq_exit_rcu();
> set_irq_regs(old_regs);
> irqentry_exit(regs, state);
>
> if (from_idle)
> regs->psw.mask &= ~(PSW_MASK_EXT | PSW_MASK_IO | PSW_MASK_WAIT);
> + percpu_exit(regs, percpu_needs_fixup);
> }
>
> static void show_msi_interrupt(struct seq_file *p, int irq)
> diff --git a/arch/s390/kernel/nmi.c b/arch/s390/kernel/nmi.c
> index 94fbfad49f62..e17a59d4d5a4 100644
> --- a/arch/s390/kernel/nmi.c
> +++ b/arch/s390/kernel/nmi.c
> @@ -22,6 +22,7 @@
> #include <linux/module.h>
> #include <linux/sched/signal.h>
> #include <linux/kvm_host.h>
> +#include <asm/entry-percpu.h>
> #include <asm/lowcore.h>
> #include <asm/ctlreg.h>
> #include <asm/fpu.h>
> @@ -363,6 +364,7 @@ NOKPROBE_SYMBOL(s390_backup_mcck_info);
> */
> void notrace s390_do_machine_check(struct pt_regs *regs)
> {
> + bool percpu_needs_fixup;
> static int ipd_count;
> static DEFINE_SPINLOCK(ipd_lock);
> static unsigned long long last_ipd;
> @@ -374,6 +376,7 @@ void notrace s390_do_machine_check(struct pt_regs *regs)
> unsigned long mcck_dam_code;
> int mcck_pending = 0;
>
> + percpu_entry(regs);
> irq_state = irqentry_nmi_enter(regs);
>
> if (user_mode(regs))
> @@ -495,7 +498,9 @@ void notrace s390_do_machine_check(struct pt_regs *regs)
> if (mcck_pending)
> schedule_mcck_handler();
>
> + percpu_needs_fixup = percpu_code_check(regs);
> irqentry_nmi_exit(regs, irq_state);
> + percpu_exit(regs, percpu_needs_fixup);
> }
> NOKPROBE_SYMBOL(s390_do_machine_check);
>
> diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
> index 1b5c6fc431cc..564403496a7c 100644
> --- a/arch/s390/kernel/traps.c
> +++ b/arch/s390/kernel/traps.c
> @@ -24,6 +24,7 @@
> #include <linux/entry-common.h>
> #include <linux/kmsan.h>
> #include <linux/bug.h>
> +#include <asm/entry-percpu.h>
> #include <asm/asm-extable.h>
> #include <asm/irqflags.h>
> #include <asm/ptrace.h>
> @@ -329,6 +330,7 @@ static void (*pgm_check_table[128])(struct pt_regs *regs);
> void noinstr __do_pgm_check(struct pt_regs *regs)
> {
> struct lowcore *lc = get_lowcore();
> + bool percpu_needs_fixup;
> irqentry_state_t state;
> unsigned int trapnr;
> union teid teid;
> @@ -349,6 +351,7 @@ void noinstr __do_pgm_check(struct pt_regs *regs)
> current->thread.gmap_int_code = regs->int_code & 0xffff;
> return;
> }
> + percpu_entry(regs);
> state = irqentry_enter(regs);
> if (user_mode(regs)) {
> update_timer_sys();
> @@ -385,7 +388,9 @@ void noinstr __do_pgm_check(struct pt_regs *regs)
> pgm_check_table[trapnr](regs);
> out:
> local_irq_disable();
> + percpu_needs_fixup = percpu_code_check(regs);
> irqentry_exit(regs, state);
> + percpu_exit(regs, percpu_needs_fixup);
> }
>
> /*
> --
> 2.51.0
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 2/7] s390/percpu: Add missing do { } while (0) constructs
2026-05-26 5:56 ` [PATCH v5 2/7] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
@ 2026-06-01 7:14 ` Alexander Gordeev
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-01 7:14 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Tue, May 26, 2026 at 07:56:57AM +0200, Heiko Carstens wrote:
> Add missing do { } while (0) constructs in order to avoid potential
> build failures.
>
> Reported-by: Sashiko <sashiko-bot@kernel.org>
> Closes: https://sashiko.dev/#/patchset/20260319120503.4046659-1-hca%40linux.ibm.com
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> ---
> arch/s390/include/asm/percpu.h | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-06-01 7:10 ` Alexander Gordeev
@ 2026-06-01 13:27 ` Heiko Carstens
2026-06-01 14:49 ` Alexander Gordeev
0 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-06-01 13:27 UTC (permalink / raw)
To: Alexander Gordeev
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Mon, Jun 01, 2026 at 09:10:47AM +0200, Alexander Gordeev wrote:
> On Tue, May 26, 2026 at 07:56:56AM +0200, Heiko Carstens wrote:
> > Get rid of the conditional branch with the following code sequence:
> >
> > 11a8e6: c0 30 00 d0 c5 0d larl %r3,1b33300
>
> Annotating with similar comments as above would help:
>
> ... <- address of percpu var
> ... <- add percpu offset
>
> > 11a8ec: b9 04 00 43 lgr %r4,%r3
> > 11a8f0: eb 00 43 c0 00 52 mviy 960,4
> > 11a8f6: e3 40 03 b8 00 08 ag %r4,952
> > 11a8fc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
> > 11a902: eb 00 03 c0 00 52 mviy 960,0
> > 11a908: b9 08 00 25 agr %r2,%r5
> > 11a90c 07 fe br %r14
Sure.
> > +static __always_inline void percpu_entry(struct pt_regs *regs)
> > +{
> > + struct lowcore *lc = get_lowcore();
> > +
> > + if (user_mode(regs))
> > + return;
> > + regs->cpu = lc->cpu_nr;
> > + regs->percpu_register = lc->percpu_register;
> > + lc->percpu_register = 0;
>
> This assignment is required when the task is migrated to a new CPU, so
> the lowcore would not be corrupted with the stale percpu_register value.
> Otherwise, it is restored in percpu_exit() if there was no migration.
> Right?
This, but it is also required to have a "clean" starting point for the
new context (exception, irq, nmi), since the new context does not run
in a percpu code section - if that context would be interrupted then
the next context might see that the previous context was running in a
percpu code section, while it was not. Therefore lc->percpu_register
must be saved and set to 0 before the new context may use percpu ops
and/or fault.
> > + if ((insn & 0xff0f) != 0xe300)
>
> Any reason mask/not to check the register number as well?
Just to save code. If there would be a mismatch with the percpu
register, something serious would be wrong..
But that cannot happen(tm) :)
> > + /* Check if process has been migrated to a different CPU. */
> > + if (regs->cpu == lc->cpu_nr)
> > + return;
> > + /* Fixup percpu base register */
> > + regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
> > + regs->gprs[reg] += lc->percpu_offset;
>
> Such one reads bit better:
>
> regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
> regs->gprs[reg] += __per_cpu_offset[reg];
That reads better, but it is wrong. 'reg' is not the index of the
current cpu into __per_cpu_offset[].
It could be changed to
regs->gprs[reg] += __per_cpu_offset[lc->cpu_nr];
but then again that's exactly lc->percpu_offset. I don't see a reason
to create worse code here.
> > + * Inline assemblies making use of this typically have a code sequence like:
> > + *
> > + * MVIY_PERCPU(...) <- start of percpu code section
> > + * AG_ALT(...) <- add percpu offset; must be the second instruction
> > + * atomic_op <- atomic op
> \t here, but should be spaces?
I can't follow? We have tabs in comments all over the place in s390 code.
> Probably it worth noting that no extra instructions between atomic_op
> and MVIY_ALT may exist (e.g. in the future), especially ones that use
> the percpu_register.
That's not true. There may be additional instructions, they may just
not use the same register that is used for the percpu variable.
But that was true before this patch as well, even though it might not
have been "obvious".
> > unsigned long flags;
> > unsigned long last_break;
> > + unsigned int cpu;
> > + unsigned char percpu_register;
>
> May be name it as this_cpu and this_cpu_reg(ister)?
Here I disagree. *If* a task would be migrated then such naming would
be very confusing: regs->this_cpu would contain the cpu number of the
_previous_ cpu where the task was running on, but not the number of
the current cpu.
So you would end up with checks like:
if (regs->this_cpu != this_cpu)
to figure out if a task was migrated. This doesn't look right to me.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-06-01 13:27 ` Heiko Carstens
@ 2026-06-01 14:49 ` Alexander Gordeev
2026-06-01 15:08 ` Heiko Carstens
0 siblings, 1 reply; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-01 14:49 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Mon, Jun 01, 2026 at 03:27:50PM +0200, Heiko Carstens wrote:
...
> > > + if ((insn & 0xff0f) != 0xe300)
> >
> > Any reason mask/not to check the register number as well?
>
> Just to save code. If there would be a mismatch with the percpu
> register, something serious would be wrong..
Is it worth checking the disp then? ;)
> But that cannot happen(tm) :)
...
> > > + /* Check if process has been migrated to a different CPU. */
> > > + if (regs->cpu == lc->cpu_nr)
> > > + return;
> > > + /* Fixup percpu base register */
> > > + regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
> > > + regs->gprs[reg] += lc->percpu_offset;
> >
> > Such one reads bit better:
> >
> > regs->gprs[reg] -= __per_cpu_offset[regs->cpu];
> > regs->gprs[reg] += __per_cpu_offset[reg];
>
> That reads better, but it is wrong. 'reg' is not the index of the
> current cpu into __per_cpu_offset[].
>
> It could be changed to
>
> regs->gprs[reg] += __per_cpu_offset[lc->cpu_nr];
>
> but then again that's exactly lc->percpu_offset. I don't see a reason
> to create worse code here.
>
> > > + * Inline assemblies making use of this typically have a code sequence like:
> > > + *
> > > + * MVIY_PERCPU(...) <- start of percpu code section
> > > + * AG_ALT(...) <- add percpu offset; must be the second instruction
> > > + * atomic_op <- atomic op
> > \t here, but should be spaces?
>
> I can't follow? We have tabs in comments all over the place in s390 code.
The other '<-' comments below and above use spaces, but this one
mixes spaces with '\t'.
> > Probably it worth noting that no extra instructions between atomic_op
> > and MVIY_ALT may exist (e.g. in the future), especially ones that use
> > the percpu_register.
>
> That's not true. There may be additional instructions, they may just
> not use the same register that is used for the percpu variable.
That is what I tried to say, but I also thought it is intentionally
only two instructions between mviy brackets allowed: ag + atomic_op.
Am I wrong here?
> But that was true before this patch as well, even though it might not
> have been "obvious".
Hmm.. I do not get it. We would not get scheduled away before this patch,
aren't we?
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-06-01 14:49 ` Alexander Gordeev
@ 2026-06-01 15:08 ` Heiko Carstens
2026-06-02 13:32 ` David Laight
0 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-06-01 15:08 UTC (permalink / raw)
To: Alexander Gordeev
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Mon, Jun 01, 2026 at 04:49:35PM +0200, Alexander Gordeev wrote:
> On Mon, Jun 01, 2026 at 03:27:50PM +0200, Heiko Carstens wrote:
> ...
> > > > + if ((insn & 0xff0f) != 0xe300)
> > >
> > > Any reason mask/not to check the register number as well?
> >
> > Just to save code. If there would be a mismatch with the percpu
> > register, something serious would be wrong..
>
> Is it worth checking the disp then? ;)
It is: the check makes sure this is an AG instruction, which adds the
percpu offset from lowcore - by checking that the displacement is
correct, as well as that the base register is zero.
There could be a different AG instruction within the inline assembly,
for whatever reason.
> > > > + * Inline assemblies making use of this typically have a code sequence like:
> > > > + *
> > > > + * MVIY_PERCPU(...) <- start of percpu code section
> > > > + * AG_ALT(...) <- add percpu offset; must be the second instruction
> > > > + * atomic_op <- atomic op
> > > \t here, but should be spaces?
> >
> > I can't follow? We have tabs in comments all over the place in s390 code.
>
> The other '<-' comments below and above use spaces, but this one
> mixes spaces with '\t'.
Because it is not possible to use tabs there. We put tabs in our
comments whenever possible.
> > > Probably it worth noting that no extra instructions between atomic_op
> > > and MVIY_ALT may exist (e.g. in the future), especially ones that use
> > > the percpu_register.
> >
> > That's not true. There may be additional instructions, they may just
> > not use the same register that is used for the percpu variable.
>
> That is what I tried to say, but I also thought it is intentionally
> only two instructions between mviy brackets allowed: ag + atomic_op.
> Am I wrong here?
Yes, you are wrong. You can use as many instructions as you want, as
long as the inline assembly follows the rules with respect to the
register which contains the percpu variable.
But seriously: all of this works only with atomic ops, so I don't see
reason why one put more instructions there. We actively try to keep
our inline assemblies as small as possible.
Even though with relocatable lowcore and alternatives the C code looks
huge, while it only generates very few instructions.
> > But that was true before this patch as well, even though it might not
> > have been "obvious".
>
> Hmm.. I do not get it. We would not get scheduled away before this patch,
> aren't we?
True. What I tried to say: before and after this patch there was and
is no code which _uses_ the pointer of the percpu variable more than
once. Accessing the percpu variable twice within such a section would
be potentially broken, since between two accesses an interrupt / nmi
could happen, which could modify the percpu var contents, which again
could result in loss of information.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-05-26 5:56 ` [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient " Heiko Carstens
2026-06-01 7:10 ` Alexander Gordeev
@ 2026-06-02 10:08 ` Alexander Gordeev
1 sibling, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-02 10:08 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Tue, May 26, 2026 at 07:56:56AM +0200, Heiko Carstens wrote:
> With the intended removal of PREEMPT_NONE this_cpu operations based on
> atomic instructions, guarded with preempt_disable()/preempt_enable() pairs
> become more expensive: the preempt_disable() / preempt_enable() pairs are
> not optimized away anymore during compile time.
>
> In particular the conditional call to preempt_schedule_notrace() after
> preempt_enable() adds additional code and register pressure.
...
>
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> ---
> arch/s390/include/asm/entry-percpu.h | 80 ++++++++++++++++++++++++++++
> arch/s390/include/asm/lowcore.h | 3 +-
> arch/s390/include/asm/percpu.h | 62 +++++++++++++++++++++
> arch/s390/include/asm/ptrace.h | 2 +
> arch/s390/kernel/irq.c | 24 ++++++---
> arch/s390/kernel/nmi.c | 5 ++
> arch/s390/kernel/traps.c | 5 ++
> 7 files changed, 173 insertions(+), 8 deletions(-)
> create mode 100644 arch/s390/include/asm/entry-percpu.h
Reviewed-by: Alexander Gordeev <agordeev@linux.ibm.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 3/7] s390/percpu: Use new percpu code section for arch_this_cpu_add()
2026-05-26 5:56 ` [PATCH v5 3/7] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
@ 2026-06-02 11:33 ` Alexander Gordeev
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-02 11:33 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Tue, May 26, 2026 at 07:56:58AM +0200, Heiko Carstens wrote:
> Convert arch_this_cpu_add() to make use of the new percpu code section
> infrastructure.
>
> With this the text size of the kernel image is reduced by ~76kb
> (defconfig). Also more than 5300 generated preempt_schedule_notrace()
> function calls within the kernel image (modules not counted) are removed.
>
> With:
>
> DEFINE_PER_CPU(long, foo);
> void bar(long a) { this_cpu_add(foo, a); }
>
> Old arch_this_cpu_add() looks like this:
>
> 00000000000000c0 <bar>:
> c0: c0 04 00 00 00 00 jgnop c0 <bar>
> c6: eb 01 03 a8 00 6a asi 936,1
> cc: c4 18 00 00 00 00 lgrl %r1,cc <bar+0xc>
> ce: R_390_GOTENT foo+0x2
> d2: e3 10 03 b8 00 08 ag %r1,952
> d8: eb 22 10 00 00 e8 laag %r2,%r2,0(%r1)
> de: eb ff 03 a8 00 6e alsi 936,-1
> e4: a7 a4 00 05 jhe ee <bar+0x2e>
> e8: c0 f4 00 00 00 00 jg e8 <bar+0x28>
> ea: R_390_PC32DBL __s390_indirect_jump_r14+0x2
> ee: c0 f4 00 00 00 00 jg ee <bar+0x2e>
> f0: R_390_PLT32DBL preempt_schedule_notrace+0x2
>
> New arch_this_cpu_add() looks like this:
>
> 00000000000000c0 <bar>:
> c0: c0 04 00 00 00 00 jgnop c0 <bar>
> c6: c4 38 00 00 00 00 lgrl %r3,c6 <bar+0x6>
> c8: R_390_GOTENT foo+0x2
> cc: b9 04 00 43 lgr %r4,%r3
> d0: eb 00 43 c0 00 52 mviy 960(%r0),4
> d6: e3 40 03 b8 00 08 ag %r4,952
> dc: eb 52 40 00 00 e8 laag %r5,%r2,0(%r4)
> e2: eb 00 03 c0 00 52 mviy 960,0
> e8: c0 f4 00 00 00 00 jg e8 <bar+0x28>
> ea: R_390_PC32DBL __s390_indirect_jump_r14+0x2
>
> Note that the conditional function call is removed.
>
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> ---
> arch/s390/include/asm/percpu.h | 65 ++++++++++++++++++++++------------
> 1 file changed, 43 insertions(+), 22 deletions(-)
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 4/7] s390/percpu: Use new percpu code section for arch_this_cpu_add_return()
2026-05-26 5:56 ` [PATCH v5 4/7] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
@ 2026-06-02 11:34 ` Alexander Gordeev
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-02 11:34 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Tue, May 26, 2026 at 07:56:59AM +0200, Heiko Carstens wrote:
> Convert arch_this_cpu_add_return() to make use of the new percpu code
> section infrastructure.
>
> With this the text size of the kernel image is reduced by ~4k
> (defconfig). Also 66 generated preempt_schedule_notrace() function
> calls within the kernel image (modules not counted) are removed.
>
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> ---
> arch/s390/include/asm/percpu.h | 26 +++++++++++++++++++-------
> 1 file changed, 19 insertions(+), 7 deletions(-)
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 5/7] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]()
2026-05-26 5:57 ` [PATCH v5 5/7] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
@ 2026-06-02 11:35 ` Alexander Gordeev
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-02 11:35 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Tue, May 26, 2026 at 07:57:00AM +0200, Heiko Carstens wrote:
> Convert arch_this_cpu_[and|or]() to make use of the new percpu code
> section infrastructure.
>
> There is no user of this_cpu_and() and only one user of this_cpu_or()
> within the kernel. Therefore this conversion has hardly any effect,
> and also removes only preempt_schedule_notrace() function call.
>
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> ---
> arch/s390/include/asm/percpu.h | 26 +++++++++++++++++++-------
> 1 file changed, 19 insertions(+), 7 deletions(-)
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 6/7] s390/percpu: Provide arch_this_cpu_read() implementation
2026-05-26 5:57 ` [PATCH v5 6/7] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
@ 2026-06-02 11:40 ` Alexander Gordeev
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-02 11:40 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Tue, May 26, 2026 at 07:57:01AM +0200, Heiko Carstens wrote:
> Provide an s390 specific implementation of arch_this_cpu_read() instead
> of the generic variant. The generic variant uses preempt_disable() /
> preempt_enable() pair and READ_ONCE().
>
> Get rid of the preempt_disable() / preempt_enable() pairs by providing an
> own variant which makes use of the new percpu code section infrastructure.
>
> With this the text size of the kernel image is reduced by ~1k
> (defconfig). Also 87 generated preempt_schedule_notrace() function
> calls within the kernel image (modules not counted) are removed.
>
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> ---
> arch/s390/include/asm/percpu.h | 31 +++++++++++++++++++++++++++++++
> 1 file changed, 31 insertions(+)
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 7/7] s390/percpu: Provide arch_this_cpu_write() implementation
2026-05-26 5:57 ` [PATCH v5 7/7] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
@ 2026-06-02 11:40 ` Alexander Gordeev
0 siblings, 0 replies; 22+ messages in thread
From: Alexander Gordeev @ 2026-06-02 11:40 UTC (permalink / raw)
To: Heiko Carstens
Cc: Sven Schnelle, Vasily Gorbik, Christian Borntraeger,
Juergen Christ, linux-kernel, linux-s390
On Tue, May 26, 2026 at 07:57:02AM +0200, Heiko Carstens wrote:
> Provide an s390 specific implementation of arch_this_cpu_write()
> instead of the generic variant. The generic variant uses a quite
> expensive raw_local_irq_save() / raw_local_irq_restore() pair.
>
> Get rid of this by providing an own variant which makes use of the new
> percpu code section infrastructure.
>
> With this the text size of the kernel image is reduced by ~1k (defconfig).
>
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> ---
> arch/s390/include/asm/percpu.h | 30 ++++++++++++++++++++++++++++++
> 1 file changed, 30 insertions(+)
Acked-by: Alexander Gordeev <agordeev@linux.ibm.com>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-06-01 15:08 ` Heiko Carstens
@ 2026-06-02 13:32 ` David Laight
2026-06-02 13:54 ` Heiko Carstens
0 siblings, 1 reply; 22+ messages in thread
From: David Laight @ 2026-06-02 13:32 UTC (permalink / raw)
To: Heiko Carstens
Cc: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ, linux-kernel, linux-s390
On Mon, 1 Jun 2026 17:08:13 +0200
Heiko Carstens <hca@linux.ibm.com> wrote:
> On Mon, Jun 01, 2026 at 04:49:35PM +0200, Alexander Gordeev wrote:
> > On Mon, Jun 01, 2026 at 03:27:50PM +0200, Heiko Carstens wrote:
> > ...
> > > > > + if ((insn & 0xff0f) != 0xe300)
> > > >
> > > > Any reason mask/not to check the register number as well?
> > >
> > > Just to save code. If there would be a mismatch with the percpu
> > > register, something serious would be wrong..
> >
> > Is it worth checking the disp then? ;)
>
> It is: the check makes sure this is an AG instruction, which adds the
> percpu offset from lowcore - by checking that the displacement is
> correct, as well as that the base register is zero.
>
> There could be a different AG instruction within the inline assembly,
> for whatever reason.
Do you actually even need to check the instruction?
This sequence can only work for simple per-cpu accesses, so I don't
see a reason to let the specified register point anywhere other than the
base of the per-cpu data.
That means the process switch code can just load the register with the
base of the per-cpu data for the new cpu.
If that happens before the 'AG' is executed it won't matter.
The only reason would be to support non-offsettable memory accesses.
But it looks like the 'laag %r5,%r2,0(%r4)' in the example has an
offset (of zero).
Probably only stops you doing a direct access of an array.
That would mean that needs_fixup goes in the bin and percpu_exit() becomes:
...
reg = regs->percpu_register;
if (likely(!reg))
return;
lc->percpu_register = reg;
regs->gprs[reg] = lc->percpu_offset
}
I guess I'm missing something?
-- David
>
> > > > > + * Inline assemblies making use of this typically have a code sequence like:
> > > > > + *
> > > > > + * MVIY_PERCPU(...) <- start of percpu code section
> > > > > + * AG_ALT(...) <- add percpu offset; must be the second instruction
> > > > > + * atomic_op <- atomic op
> > > > \t here, but should be spaces?
> > >
> > > I can't follow? We have tabs in comments all over the place in s390 code.
> >
> > The other '<-' comments below and above use spaces, but this one
> > mixes spaces with '\t'.
>
> Because it is not possible to use tabs there. We put tabs in our
> comments whenever possible.
>
> > > > Probably it worth noting that no extra instructions between atomic_op
> > > > and MVIY_ALT may exist (e.g. in the future), especially ones that use
> > > > the percpu_register.
> > >
> > > That's not true. There may be additional instructions, they may just
> > > not use the same register that is used for the percpu variable.
> >
> > That is what I tried to say, but I also thought it is intentionally
> > only two instructions between mviy brackets allowed: ag + atomic_op.
> > Am I wrong here?
>
> Yes, you are wrong. You can use as many instructions as you want, as
> long as the inline assembly follows the rules with respect to the
> register which contains the percpu variable.
>
> But seriously: all of this works only with atomic ops, so I don't see
> reason why one put more instructions there. We actively try to keep
> our inline assemblies as small as possible.
>
> Even though with relocatable lowcore and alternatives the C code looks
> huge, while it only generates very few instructions.
>
> > > But that was true before this patch as well, even though it might not
> > > have been "obvious".
> >
> > Hmm.. I do not get it. We would not get scheduled away before this patch,
> > aren't we?
>
> True. What I tried to say: before and after this patch there was and
> is no code which _uses_ the pointer of the percpu variable more than
> once. Accessing the percpu variable twice within such a section would
> be potentially broken, since between two accesses an interrupt / nmi
> could happen, which could modify the percpu var contents, which again
> could result in loss of information.
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-06-02 13:32 ` David Laight
@ 2026-06-02 13:54 ` Heiko Carstens
2026-06-02 14:36 ` David Laight
0 siblings, 1 reply; 22+ messages in thread
From: Heiko Carstens @ 2026-06-02 13:54 UTC (permalink / raw)
To: David Laight
Cc: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ, linux-kernel, linux-s390
On Tue, Jun 02, 2026 at 02:32:09PM +0100, David Laight wrote:
> On Mon, 1 Jun 2026 17:08:13 +0200
> Heiko Carstens <hca@linux.ibm.com> wrote:
> > It is: the check makes sure this is an AG instruction, which adds the
> > percpu offset from lowcore - by checking that the displacement is
> > correct, as well as that the base register is zero.
> >
> > There could be a different AG instruction within the inline assembly,
> > for whatever reason.
>
> Do you actually even need to check the instruction?
>
> This sequence can only work for simple per-cpu accesses, so I don't
> see a reason to let the specified register point anywhere other than the
> base of the per-cpu data.
>
> That means the process switch code can just load the register with the
> base of the per-cpu data for the new cpu.
> If that happens before the 'AG' is executed it won't matter.
>
> The only reason would be to support non-offsettable memory accesses.
> But it looks like the 'laag %r5,%r2,0(%r4)' in the example has an
> offset (of zero).
> Probably only stops you doing a direct access of an array.
>
> That would mean that needs_fixup goes in the bin and percpu_exit() becomes:
> ...
> reg = regs->percpu_register;
> if (likely(!reg))
> return;
> lc->percpu_register = reg;
> regs->gprs[reg] = lc->percpu_offset
> }
>
> I guess I'm missing something?
The percpu register (in the above example %r4) first contains the base address
of a percpu variable. To get the actual percpu address of the variable the
percpu_offset of the corresponding cpu has to be added to that address, which
is what the AG instruction is doing.
What you propose would make a CPU's percpu_offset the address of any percpu
variable, which most likely would result in a crash.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient this_cpu operations
2026-06-02 13:54 ` Heiko Carstens
@ 2026-06-02 14:36 ` David Laight
0 siblings, 0 replies; 22+ messages in thread
From: David Laight @ 2026-06-02 14:36 UTC (permalink / raw)
To: Heiko Carstens
Cc: Alexander Gordeev, Sven Schnelle, Vasily Gorbik,
Christian Borntraeger, Juergen Christ, linux-kernel, linux-s390
On Tue, 2 Jun 2026 15:54:36 +0200
Heiko Carstens <hca@linux.ibm.com> wrote:
> On Tue, Jun 02, 2026 at 02:32:09PM +0100, David Laight wrote:
> > On Mon, 1 Jun 2026 17:08:13 +0200
> > Heiko Carstens <hca@linux.ibm.com> wrote:
> > > It is: the check makes sure this is an AG instruction, which adds the
> > > percpu offset from lowcore - by checking that the displacement is
> > > correct, as well as that the base register is zero.
> > >
> > > There could be a different AG instruction within the inline assembly,
> > > for whatever reason.
> >
> > Do you actually even need to check the instruction?
> >
> > This sequence can only work for simple per-cpu accesses, so I don't
> > see a reason to let the specified register point anywhere other than the
> > base of the per-cpu data.
> >
> > That means the process switch code can just load the register with the
> > base of the per-cpu data for the new cpu.
> > If that happens before the 'AG' is executed it won't matter.
> >
> > The only reason would be to support non-offsettable memory accesses.
> > But it looks like the 'laag %r5,%r2,0(%r4)' in the example has an
> > offset (of zero).
> > Probably only stops you doing a direct access of an array.
> >
> > That would mean that needs_fixup goes in the bin and percpu_exit() becomes:
> > ...
> > reg = regs->percpu_register;
> > if (likely(!reg))
> > return;
> > lc->percpu_register = reg;
> > regs->gprs[reg] = lc->percpu_offset
> > }
> >
> > I guess I'm missing something?
>
> The percpu register (in the above example %r4) first contains the base address
> of a percpu variable. To get the actual percpu address of the variable the
> percpu_offset of the corresponding cpu has to be added to that address, which
> is what the AG instruction is doing.
I knew my brain was fading.
I'm sure it should be possible to get the linker to put the offset of
the variable from the base on the per-cpu data area into the laag instruction.
(Looks like it has a 20bit offset field.)
Although I've no idea how per-cpu data works in loadable modules.
That might mean you need lc->percpu_address as well as lc->percpu_offset.
(If it isn't there already.)
-- David
>
> What you propose would make a CPU's percpu_offset the address of any percpu
> variable, which most likely would result in a crash.
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2026-06-02 14:36 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-26 5:56 [PATCH v5 0/7] s390: Improve this_cpu operations Heiko Carstens
2026-05-26 5:56 ` [PATCH v5 1/7] s390/percpu: Infrastructure for more efficient " Heiko Carstens
2026-06-01 7:10 ` Alexander Gordeev
2026-06-01 13:27 ` Heiko Carstens
2026-06-01 14:49 ` Alexander Gordeev
2026-06-01 15:08 ` Heiko Carstens
2026-06-02 13:32 ` David Laight
2026-06-02 13:54 ` Heiko Carstens
2026-06-02 14:36 ` David Laight
2026-06-02 10:08 ` Alexander Gordeev
2026-05-26 5:56 ` [PATCH v5 2/7] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
2026-06-01 7:14 ` Alexander Gordeev
2026-05-26 5:56 ` [PATCH v5 3/7] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
2026-06-02 11:33 ` Alexander Gordeev
2026-05-26 5:56 ` [PATCH v5 4/7] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
2026-06-02 11:34 ` Alexander Gordeev
2026-05-26 5:57 ` [PATCH v5 5/7] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
2026-06-02 11:35 ` Alexander Gordeev
2026-05-26 5:57 ` [PATCH v5 6/7] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
2026-06-02 11:40 ` Alexander Gordeev
2026-05-26 5:57 ` [PATCH v5 7/7] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
2026-06-02 11:40 ` Alexander Gordeev
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox