Re: [PATCH v3 0/9] s390: Improve this_cpu operations

Linux s390 Architecture development
 help / color / mirror / Atom feed

From: David Laight <david.laight.linux@gmail.com>
To: Heiko Carstens <hca@linux.ibm.com>
Cc: Yang Shi <yang@os.amperecomputing.com>,
	Alexander Gordeev <agordeev@linux.ibm.com>,
	Sven Schnelle <svens@linux.ibm.com>,
	Vasily Gorbik <gor@linux.ibm.com>,
	Christian Borntraeger <borntraeger@linux.ibm.com>,
	Juergen Christ <jchrist@linux.ibm.com>,
	"Christoph Lameter (Ampere)" <cl@gentwo.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Shrikanth Hegde <sshegde@linux.ibm.com>,
	linux-kernel@vger.kernel.org, linux-s390@vger.kernel.org
Subject: Re: [PATCH v3 0/9] s390: Improve this_cpu operations
Date: Thu, 28 May 2026 18:14:45 +0100	[thread overview]
Message-ID: <20260528181445.79859403@pumpkin> (raw)
In-Reply-To: <20260528141441.15387D07-hca@linux.ibm.com>

On Thu, 28 May 2026 16:14:41 +0200
Heiko Carstens <hca@linux.ibm.com> wrote:

> On Wed, May 27, 2026 at 04:44:31PM -0700, Yang Shi wrote:
> > On 5/22/26 2:18 AM, Heiko Carstens wrote:  
> > > It is amazing to see the performance improvements you see on arm64, however
> > > I believe that is mainly because of the large amount of code which is
> > > generated by the arm64 implementations of the preempt primitives
> > > __preempt_count_add() and __preempt_count_dec_and_test().  
> > 
> > Yes, we need 4 instructions on ARM64 for disabling/enabling preempt (one
> > instruction is used to load current pointer, the other 3 instructions are
> > used to RMW preempt_count). So I can remove 8 instructions in total for a
> > single this_cpu ops. That's a lot. Given this_cpu ops are heavily used in
> > kernel, we end up running fewer instructions and having better icache hit
> > rate, the better icache hit rate also helps reduce cross node traffic for
> > 2-socket system.  
> 
> You save more. Look at arm64's __preempt_count_dec_and_test()
> implementation: it is RMW + compare + READ + compare.
> 
> preempt_enable() generates this code, where x1 seems to contain the
> preempt_count pointer:
> 
>   80:   f9400420        ldr     x0, [x1, #8]
>   84:   d1000400        sub     x0, x0, #0x1
>   88:   b9000820        str     w0, [x1, #8]
>   8c:   b4000060        cbz     x0, 98 <bar+0x58>
>   90:   f9400420        ldr     x0, [x1, #8]
>   94:   b5000040        cbnz    x0, 9c <bar+0x5c>
>   98:   94000000        bl      0 <preempt_schedule_notrace>
>   9c:   ...
> 
> I assume arm64's instruction set does not allow for better code for
> __preempt_count_dec_and_test() if you would fold the need_resched bit into
> preempt_count and use atomic instructions + inline assembly with flag
> output operands when modifying preempt_count.
> As of now only x86 and s390 are doing that.

I think arm64 only has single instruction exchanges - which makes life hard.
But it has to be possible to do better than the above.
The 'normal' path (not nested, no preemption) seems to execute everything
except the 'bl'.
All the 'not preempted' paths have a taken forwards conditional branch
that stands a fair chance of being mispredicted.
There is also the 32bit write followed by a 64bit read of the same address.
That will 'break' any logic that does 'store to load' forwarding (where
the read is satisfied from the store buffer) and add more delays.
That means I think you need something like:
	ldr	w0, [x1, #8]
	sub	x0, x0, #1
	str	w0, [x1, #8]
	ldr	w2, [x1, #12]
	or	x0, x0, x2 
	cbz	x0, 1f
2:
# sometime later.
1:
	bl	preempt_schedule:
	b	2b

But the last arm system I wrote asm for was a strongarm!
And the book I have is from 2004.

The definition:
#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) \
		__preempt_schedule(); \
} while (0) 
doesn't really help.
gcc tends to ignore the unlikely() when the other path is empty
and just generates a forwards branch around the call.
Forcing it to generate both parts of the if can help.
So:
#define preempt_enable() \
do { \
	barrier(); \
	if (unlikely(preempt_count_dec_and_test())) \
		__preempt_schedule(); \
	else \
		asm (""); \
} while (0)
can be enough to force a conditional branch to the call.

-- David

next prev parent reply	other threads:[~2026-05-28 17:14 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-20  9:22 [PATCH v3 0/9] s390: Improve this_cpu operations Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 1/9] s390/alternatives: Add new ALT_TYPE_PERCPU type Heiko Carstens
2026-05-20 12:43   ` David Laight
2026-05-20 13:50     ` Heiko Carstens
2026-05-20 14:16       ` Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 2/9] s390/percpu: Infrastructure for more efficient this_cpu operations Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 3/9] s390/percpu: Add missing do { } while (0) constructs Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 4/9] s390/percpu: Use new percpu code section for arch_this_cpu_add() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 5/9] s390/percpu: Use new percpu code section for arch_this_cpu_add_return() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 6/9] s390/percpu: Use new percpu code section for arch_this_cpu_[and|or]() Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 7/9] s390/percpu: Provide arch_this_cpu_read() implementation Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 8/9] s390/percpu: Provide arch_this_cpu_write() implementation Heiko Carstens
2026-05-20  9:22 ` [PATCH v3 9/9] s390/percpu: Remove one and two byte this_cpu operation implementation Heiko Carstens
2026-05-20 18:42 ` [PATCH v3 0/9] s390: Improve this_cpu operations Yang Shi
2026-05-20 22:34   ` David Laight
2026-05-21  0:23     ` Yang Shi
2026-05-21 10:17       ` David Laight
2026-05-21 16:57         ` Yang Shi
2026-05-21 17:55           ` David Laight
2026-05-21 20:46             ` Yang Shi
2026-05-21 22:13               ` David Laight
2026-05-21 23:41                 ` Yang Shi
2026-05-21 10:23       ` David Laight
2026-05-21 17:48         ` Yang Shi
2026-05-21 10:37       ` Heiko Carstens
2026-05-21 17:47         ` Yang Shi
2026-05-22  9:18           ` Heiko Carstens
2026-05-27 19:09             ` Christoph Lameter (Ampere)
2026-05-27 20:38               ` Yang Shi
2026-05-28  8:36                 ` David Laight
2026-05-27 23:44             ` Yang Shi
2026-05-28  9:03               ` David Laight
2026-05-28 19:19                 ` Yang Shi
2026-05-28 20:34                   ` David Laight
2026-05-28 14:14               ` Heiko Carstens
2026-05-28 17:14                 ` David Laight [this message]
2026-05-28 18:39                 ` Yang Shi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260528181445.79859403@pumpkin \
    --to=david.laight.linux@gmail.com \
    --cc=agordeev@linux.ibm.com \
    --cc=borntraeger@linux.ibm.com \
    --cc=cl@gentwo.org \
    --cc=gor@linux.ibm.com \
    --cc=hca@linux.ibm.com \
    --cc=jchrist@linux.ibm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-s390@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=sshegde@linux.ibm.com \
    --cc=svens@linux.ibm.com \
    --cc=yang@os.amperecomputing.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox