All of lore.kernel.org
 help / color / mirror / Atom feed
From: Heiko Carstens <hca@linux.ibm.com>
To: Yang Shi <yang@os.amperecomputing.com>
Cc: David Laight <david.laight.linux@gmail.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Juergen Christ <jchrist@linux.ibm.com>,
	"Christoph Lameter (Ampere)" <cl@gentwo.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org
Subject: Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Date: Thu, 21 May 2026 12:37:42 +0200	[thread overview]
Message-ID: <20260521103742.9603C8c-hca@linux.ibm.com> (raw)
In-Reply-To: <d8e61923-2e0b-422c-b2f6-5ccedf3852bb@os.amperecomputing.com>

On Wed, May 20, 2026 at 05:23:37PM -0700, Yang Shi wrote:
> > > If I understand correctly, you replaced preempt_disable() and
> > > preempt_enable() with seq begin and seg end, and seq begin and seq end
> > > can be optimized by mvyi instruction on S390. So you just need a single
> > > mvyi instruction for each instead of read-modify-write the seq count.
> > > 
> > > But you need some extra overhead for context switch (save and restore
> > > the seq count register) and need to check whether it is still on the
> > > same cpu once resuming execution. And there is also penalty if it is
> > > migrated to another CPU (need to rerun this_cpu ops).
> > Not as I understand it.
> > What happens is the context switch code 'corrupts' the register being
> > used to access per-cpu data so that it is correct for the new cpu.
> > The write of zero after the sequence is there to stop the register
> > being corrupted outside of this code window.
> 
> Thanks for elaborating it. I misunderstood some nuance. I read the patch #2
> commit message, now I think I understand how it works.

As background: s390 has so called prefix pages; the first two pages of every
CPU are percpu, via a special prefixing mechanism. Parts of the pages can be
used by operating systems as percpu data area, which we use to have fast
access to e.g. the 'current' pointer, the pid, percpu_offset of the current
cpu, etc.

Helpful is also that for instructions which access memory with a base register
zero, its contents are assumed to be zero for address generation by the
hardware, regardless of its real contents. That is, the above

        ag %r4,952

is the short version of

        ag %r4,952(%r0)

The eight bytes at offset 952 of the current CPU's prefix page are added to
register 4. Real contents of register 0 are irrelevant for such address
generations; reducing register pressure.

> Borrowed the disassemble from patch #2 commit message:
> 
>   11a8e6:       c0 30 00 d0 c5 0d       larl    %r3,1b33300
>   11a8ec:       b9 04 00 43             lgr     %r4,%r3
>   11a8f0:       eb 00 43 c0 00 52       mviy    960,4
>   11a8f6:       e3 40 03 b8 00 08       ag      %r4,952
>   11a8fc:       eb 52 40 00 00 e8       laag    %r5,%r2,0(%r4)
>   11a902:       eb 00 03 c0 00 52       mviy    960,0
>   11a908:       b9 08 00 25             agr     %r2,%r5
>   11a90c        07 fe                   br      %r14
> 
> 11a8f0 loads the percpu offset and mark the percpu code section begin, I
> believe this is needed with percpu page table too because we need load local
> percpu offset.

No, 11a8f0 _writes_ the base register number, which contains the percpu
address used by the percpu atomic op at 11a8fc, to offset 960 of the first
prefix page. It could also be written like

	mviy 960(%r0),4

maybe that makes it more obvious what happens. And yes, this marks the
beginning of a percpu code section. The percpu offset is added to register r4
at 11a8f6 with the ag instruction. This could also be written like

	ag %r4,952(%r0)

This reads the eight byte percpu_offset from offset 952 of the first prefix
page, and adds it to register r4.

> 11a920 loads 0 to the register to mark the percpu code section end, this is
> not needed with percpu page table.

I guess you meant 11a902. But yes, this marks the end of the percpu code
section. Just that this is not a register, but a memory location where is
written to.

> And you need to save the register at the irq/exception entry, then restore
> it at exit. But you also need to check whether migration happens or not, if
> it happens kernel needs to rewrite the register with correct percpu offset
> and needs to check whether the interrupted instruction is "ag".

Yes.

> If it is "ag" instruction (11a8f6) , kernel needs to recalculate the percpu
> address, right?

No, if it is within the percpu code section and it is _not_ the ag instruction,
the percpu base register needs to be adjusted (that's by the way a bug in
patch two, which has this logic inverted - my mistake).

> It sounds a little bit hacky to me TBH and incur some extra overhead for
> "migration detection" and fixup.

Sure, it is hacky, and the small overhead part is of course true.

Compared to the percpu page table proposal the two mviy instructions above
would go away, as well as the extra interrupt/exception overhead. Besides
that your proposal is way less hacky.

> > > So it seems have more overhead than the percpu page table approach IIUC.
> > > We don't need all the steps with percpu page table. And there is no
> > > penalty for migration.
> > This code looks like it relies on 'page zero' already being percpu.
> > So it probably isn't really that different.
> > Some values like the 'preemption disable count' and 'current' could be
> > (maybe are?) written into page zero to give fast access.
> 
> I don't quite get what you mean about 'page zero'.

Hopefully the above description with prefix pages explains it?

  parent reply	other threads:[~2026-05-21 10:37 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
2026-05-20 12:43   ` David Laight
2026-05-20 13:50     ` Heiko Carstens
2026-05-20 14:16       ` Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 2/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 3/9] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
2026-05-20 18:42 ` [PATCH v3 0/9] s390: Improve this_cpu operations Yang Shi
2026-05-20 22:34   ` David Laight
2026-05-21  0:23     ` Yang Shi
2026-05-21 10:17       ` David Laight
2026-05-21 16:57         ` Yang Shi
2026-05-21 17:55           ` David Laight
2026-05-21 20:46             ` Yang Shi
2026-05-21 22:13               ` David Laight
2026-05-21 23:41                 ` Yang Shi
2026-05-21 10:23       ` David Laight
2026-05-21 17:48         ` Yang Shi
2026-05-21 10:37       ` Heiko Carstens [this message]
2026-05-21 17:47         ` Yang Shi
2026-05-22  9:18           ` Heiko Carstens
2026-05-27 19:09             ` Christoph Lameter (Ampere)
2026-05-27 20:38               ` Yang Shi
2026-05-28  8:36                 ` David Laight
2026-05-27 23:44             ` Yang Shi
2026-05-28  9:03               ` David Laight
2026-05-28 19:19                 ` Yang Shi
2026-05-28 20:34                   ` David Laight
2026-05-28 14:14               ` Heiko Carstens
2026-05-28 17:14                 ` David Laight
2026-05-28 18:39                 ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260521103742.9603C8c-hca@linux.ibm.com \
    --to=hca@linux.ibm.com \
    --cc=agordeev@linux.ibm.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=cl@gentwo.org \
    --cc=david.laight.linux@gmail.com \
    --cc=gor@linux.ibm.com \
    --cc=jchrist@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=sshegde@linux.ibm.com \
    --cc=svens@linux.ibm.com \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.