All of lore.kernel.org
 help / color / mirror / Atom feed
From: Heiko Carstens <hca@linux.ibm.com>
To: Yang Shi <yang@os.amperecomputing.com>
Cc: David Laight <david.laight.linux@gmail.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Juergen Christ <jchrist@linux.ibm.com>,
	"Christoph Lameter (Ampere)" <cl@gentwo.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org
Subject: Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Date: Fri, 22 May 2026 11:18:05 +0200	[thread overview]
Message-ID: <20260522091805.18098A5c-hca@linux.ibm.com> (raw)
In-Reply-To: <5158d4e8-19a7-4f60-b2fd-bc6bab22baf0@os.amperecomputing.com>

On Thu, May 21, 2026 at 10:47:49AM -0700, Yang Shi wrote:
> > As background: s390 has so called prefix pages; the first two pages of every
> > CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
> > used by operating systems as percpu data area, which we use to have fast
> > access to e.g. the 'current' pointer, the pid, percpu_offset of the current
> > cpu, etc.
> > 
> > Helpful is also that for instructions which access memory with a base register
> > zero, its contents are assumed to be zero for address generation by the
> > hardware, regardless of its real contents. That is, the above
> > 
> >          ag %r4,952
> > 
> > is the short version of
> > 
> >          ag %r4,952(%r0)
> > 
> > The eight bytes at offset 952 of the current CPU's prefix page are added to
> > register 4. Real contents of register 0 are irrelevant for such address
> > generations; reducing register pressure.
> 
> Aha, I see. So the prefix pages are some special memory?

No, it is regular memory. The CPU has a special "prefix register". If
that is set to an address not equal to zero all memory accesses to the
first two pages will be transparently redirected to the 8k memory area
specified with that register.

E.g. the prefix register contains the value 0x10000. If then a memory
access to address 0x400 happens the CPU will transparently turn that
into a memory access to address 0x10400. Or in other words, that is a
small per cpu memory area mechanism provided by the architecture.

> > >    11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
> > >    11a8ec:       b9 04 00 43             lgr     %r4,%r3
> > >    11a8f0:       eb 00 43 c0 00 52       mviy    960,4
> > >    11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
> > >    11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
> > >    11a902:       eb 00 03 c0 00 52       mviy    960,0
> > >    11a908:       b9 08 00 25             agr     %r2,%r5
> > >    11a90c        07 fe                   br      %r14

...

> > > 11a920 loads 0 to the register to mark the percpu code section end, this is
> > > not needed with percpu page table.
> > I guess you meant 11a902. But yes, this marks the end of the percpu code
> > section. Just that this is not a register, but a memory location where is
> > written to.
> 
> So both mviy instructions actually do memory store?

Yes.

> > > It sounds a little bit hacky to me TBH and incur some extra overhead for
> > > "migration detection" and fixup.
> > Sure, it is hacky, and the small overhead part is of course true.
> > 
> > Compared to the percpu page table proposal the two mviy instructions above
> > would go away, as well as the extra interrupt/exception overhead. Besides
> > that your proposal is way less hacky.
> 
> It would be great if we can compare the performance number for the two
> approaches. rseq has been discussed for ARM64, but it seems too expensive
> and just move the overhead to somewhere else.

I tried to implement the proposed rseq/kseq, but the required inline
assemblies resulted in code which was larger than what we have now for
s390.

Also with the current proposal I only did some quick micro benchmarks,
which resulted in 0-1% improvement, which is in the expected range.

It is amazing to see the performance improvements you see on arm64, however
I believe that is mainly because of the large amount of code which is
generated by the arm64 implementations of the preempt primitives
__preempt_count_add() and __preempt_count_dec_and_test().

That's a big difference to s390: for both primitives the result is a single
instruction.

  reply	other threads:[~2026-05-22  9:18 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
2026-05-20 12:43   ` David Laight
2026-05-20 13:50     ` Heiko Carstens
2026-05-20 14:16       ` Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 2/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 3/9] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
2026-05-20 18:42 ` [PATCH v3 0/9] s390: Improve this_cpu operations Yang Shi
2026-05-20 22:34   ` David Laight
2026-05-21  0:23     ` Yang Shi
2026-05-21 10:17       ` David Laight
2026-05-21 16:57         ` Yang Shi
2026-05-21 17:55           ` David Laight
2026-05-21 20:46             ` Yang Shi
2026-05-21 22:13               ` David Laight
2026-05-21 23:41                 ` Yang Shi
2026-05-21 10:23       ` David Laight
2026-05-21 17:48         ` Yang Shi
2026-05-21 10:37       ` Heiko Carstens
2026-05-21 17:47         ` Yang Shi
2026-05-22  9:18           ` Heiko Carstens [this message]
2026-05-27 19:09             ` Christoph Lameter (Ampere)
2026-05-27 20:38               ` Yang Shi
2026-05-28  8:36                 ` David Laight
2026-05-27 23:44             ` Yang Shi
2026-05-28  9:03               ` David Laight
2026-05-28 19:19                 ` Yang Shi
2026-05-28 20:34                   ` David Laight
2026-05-28 14:14               ` Heiko Carstens
2026-05-28 17:14                 ` David Laight
2026-05-28 18:39                 ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260522091805.18098A5c-hca@linux.ibm.com \
    --to=hca@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=cl@gentwo.org \
    --cc=david.laight.linux@gmail.com \
    --cc=gor@linux.ibm.com \
    --cc=jchrist@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=sshegde@linux.ibm.com \
    --cc=svens@linux.ibm.com \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.