[PATCH 1/2] powerpc/locking: implement this_cpu

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 1/2] powerpc/locking: implement this_cpu_cmpxchg local API
@ 2023-12-04  2:23 Luming Yu
  2023-12-11 11:40 ` Michael Ellerman
  0 siblings, 1 reply; 4+ messages in thread
From: Luming Yu @ 2023-12-04  2:23 UTC (permalink / raw)
  To: linuxppc-dev, linux-kernel, mpe, npiggin, christophe.leroy
  Cc: shenghui.qu, Luming Yu, dawei.li, ke.zhao, luming.yu

ppc appears to have already supported cmpxchg-local atomic semantics
that is defined by the kernel convention of the feature.
Add this_cpu_cmpxchg ppc local for the performance benefit of arch
sepcific implementation than asm-generic c verison of the locking API.

Signed-off-by: Luming Yu <luming.yu@shingroup.cn>
---
 arch/powerpc/include/asm/percpu.h | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
index 8e5b7d0b851c..ceab5df6e7ab 100644
--- a/arch/powerpc/include/asm/percpu.h
+++ b/arch/powerpc/include/asm/percpu.h
@@ -18,5 +18,22 @@
 #include <asm-generic/percpu.h>
 
 #include <asm/paca.h>
+#include <asm/cmpxchg.h>
+#ifdef this_cpu_cmpxchg_1
+#undef this_cpu_cmpxchg_1
+#define this_cpu_cmpxchg_1(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 1)
+#endif 
+#ifdef this_cpu_cmpxchg_2
+#undef this_cpu_cmpxchg_2
+#define this_cpu_cmpxchg_2(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 2)
+#endif
+#ifdef this_cpu_cmpxchg_4
+#undef this_cpu_cmpxchg_4
+#define this_cpu_cmpxchg_4(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 4)
+#endif
+#ifdef this_cpu_cmpxchg_8
+#undef this_cpu_cmpxchg_8
+#define this_cpu_cmpxchg_8(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 8)
+#endif
 
 #endif /* _ASM_POWERPC_PERCPU_H_ */
-- 
2.42.0.windows.2


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] powerpc/locking: implement this_cpu_cmpxchg local API
  2023-12-04  2:23 [PATCH 1/2] powerpc/locking: implement this_cpu_cmpxchg local API Luming Yu
@ 2023-12-11 11:40 ` Michael Ellerman
  2023-12-15  8:50   ` Luming Yu
  0 siblings, 1 reply; 4+ messages in thread
From: Michael Ellerman @ 2023-12-11 11:40 UTC (permalink / raw)
  To: Luming Yu, linuxppc-dev, linux-kernel, npiggin, christophe.leroy
  Cc: shenghui.qu, Luming Yu, dawei.li, ke.zhao, luming.yu

Hi Luming Yu,

Luming Yu <luming.yu@shingroup.cn> writes:
> ppc appears to have already supported cmpxchg-local atomic semantics
> that is defined by the kernel convention of the feature.
> Add this_cpu_cmpxchg ppc local for the performance benefit of arch
> sepcific implementation than asm-generic c verison of the locking API.

Implementing this has been suggested before but it wasn't clear that it
was actually going to perform better than the generic version.

On 64-bit we have interrupt soft masking, so disabling interrupts is
relatively cheap. So the generic this_cpu_cmpxhg using irq disable just
becomes a few loads & stores, no atomic ops required.

In contrast implementing it using __cmpxchg_local() will use
ldarx/stdcx etc. which will be more expensive.

Have you done any performance measurements?

It probably is a bit fishy that we don't mask PMU interrupts vs
this_cpu_cmpxchg() etc., but I don't think it's actually a bug given the
few places using this_cpu_cmpxchg(). Though I could be wrong about that.

> diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
> index 8e5b7d0b851c..ceab5df6e7ab 100644
> --- a/arch/powerpc/include/asm/percpu.h
> +++ b/arch/powerpc/include/asm/percpu.h
> @@ -18,5 +18,22 @@
>  #include <asm-generic/percpu.h>
>  
>  #include <asm/paca.h>
> +#include <asm/cmpxchg.h>
> +#ifdef this_cpu_cmpxchg_1
> +#undef this_cpu_cmpxchg_1

If we need to undef then I think something has gone wrong with the
header inclusion order, the model should be that the arch defines what
it has and the generic code provides fallbacks if the arch didn't define
anything.

> +#define this_cpu_cmpxchg_1(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 1)

I think this is unsafe vs preemption. The raw_cpu_ptr() can generate the
per-cpu address, but then the task can be preempted and moved to a
different CPU before the ldarx/stdcx do the cmpxchg.

The arm64 implementation had the same bug until they fixed it.

> +#endif 
> +#ifdef this_cpu_cmpxchg_2
> +#undef this_cpu_cmpxchg_2
> +#define this_cpu_cmpxchg_2(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 2)
> +#endif
> +#ifdef this_cpu_cmpxchg_4
> +#undef this_cpu_cmpxchg_4
> +#define this_cpu_cmpxchg_4(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 4)
> +#endif
> +#ifdef this_cpu_cmpxchg_8
> +#undef this_cpu_cmpxchg_8
> +#define this_cpu_cmpxchg_8(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 8)
> +#endif
>  
>  #endif /* _ASM_POWERPC_PERCPU_H_ */

cheers

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] powerpc/locking: implement this_cpu_cmpxchg local API
  2023-12-11 11:40 ` Michael Ellerman
@ 2023-12-15  8:50   ` Luming Yu
  2024-08-15  9:04     ` 虞陆铭
  0 siblings, 1 reply; 4+ messages in thread
From: Luming Yu @ 2023-12-15  8:50 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: ke.zhao, dawei.li, npiggin, linux-kernel, luming.yu, shenghui.qu,
	linuxppc-dev

On Mon, Dec 11, 2023 at 10:40:38PM +1100, Michael Ellerman wrote:
> Hi Luming Yu,
> 
> Luming Yu <luming.yu@shingroup.cn> writes:
> > ppc appears to have already supported cmpxchg-local atomic semantics
> > that is defined by the kernel convention of the feature.
> > Add this_cpu_cmpxchg ppc local for the performance benefit of arch
> > sepcific implementation than asm-generic c verison of the locking API.
> 
> Implementing this has been suggested before but it wasn't clear that it
> was actually going to perform better than the generic version.
Thanks for the info. To me, it is a news. : -)
I will check if any web search engine could serve me well to find it out. 
> 
> On 64-bit we have interrupt soft masking, so disabling interrupts is
> relatively cheap. So the generic this_cpu_cmpxhg using irq disable just
> becomes a few loads & stores, no atomic ops required.

something like this just popped up in my first try with a p8 test kvm on
a c1000 powernv8 platform?

I'm not sure the soft lockup is relevant to the interrupt soft masking,
but I will find it out for sure. : -)

[  460.217669] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[  460.217742] Modules linked in:
[  460.217828] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W    L   N 6.7.0-rc5+ #5
[  460.217915] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1200 0xf000005 of:SLOF,git-6b6c16 pSeries
[  460.217999] NIP:  c00000000003e0ec LR: c00000000003e414 CTR: 0000000000000000
[  460.218074] REGS: c000000004797788 TRAP: 0900   Tainted: G        W    L   N  (6.7.0-rc5+)
[  460.218151] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24042442  XER: 00000000
[  460.218342] CFAR: 0000000000000000 IRQMASK: 0
[  460.218342] GPR00: c00000000003e414 c000000004797760 c000000001583b00 c000000004797758
[  460.218342] GPR04: 0000000000000000 0000000000000004 c000000004ccf51c c00000000224e510
[  460.218342] GPR08: 4000000000000002 0000000000000049 c00000000457b280 0015000b00170038
[  460.218342] GPR12: 00340030003a0010 c000000002f40000 0000000000000008 c000000004ccf4fc
[  460.218342] GPR16: 0000000000007586 c0000000040f45c0 c000000004ddd080 c0000000040f45c0
[  460.218342] GPR20: 0000000000000008 0000000000000024 0000000000000004 c000000004ccf4fc
[  460.218342] GPR24: 000000000000003f 0000000000000001 0000000000000001 c000000004cc6e7e
[  460.218342] GPR28: fcffffffffffffff 0000000000000002 fcffffffffffffff 0000000000000003
[  460.219187] NIP [c00000000003e0ec] __replay_soft_interrupts+0x3c/0x160
[  460.219286] LR [c00000000003e414] arch_local_irq_restore+0x174/0x1a0
[  460.219365] Call Trace:
[  460.219400] [c000000004797760] [c00000000003e150] __replay_soft_interrupts+0xa0/0x160 (unreliable)
[  460.219515] [c000000004797910] [c00000000003e414] arch_local_irq_restore+0x174/0x1a0
[  460.219612] [c000000004797950] [c000000000a155c4] get_random_u32+0xb4/0x140
[  460.219699] [c0000000047979a0] [c0000000008b1ae0] get_rcw_we+0x138/0x274
[  460.219781] [c000000004797a30] [c00000000208d4bc] test_rslib_init+0x8b8/0xb70
[  460.219877] [c000000004797c40] [c00000000000fb80] do_one_initcall+0x60/0x390
[  460.219965] [c000000004797d10] [c000000002004a18] kernel_init_freeable+0x32c/0x3dc
[  460.220059] [c000000004797de0] [c0000000000102a4] kernel_init+0x34/0x1b0
[  460.220141] [c000000004797e50] [c00000000000cf14] ret_from_kernel_user_thread+0x14/0x1c
[  460.220229] --- interrupt: 0 at 0x0
[  460.220291] Code: 60000000 7c0802a6 f8010010 f821fe51 e92d0c78 f92101a8 39200000 38610028 892d0933 61290040 992d0933 48043a3d <60000000> 39200000 e9410130 f9210160
[  460.955369] Testing (23,15)_64 code...

> 
> In contrast implementing it using __cmpxchg_local() will use
> ldarx/stdcx etc. which will be more expensive.
> 
> Have you done any performance measurements?
Yes, I'm looking for resource to track the perf changes (positive or negative)
in this corner for this patch set being proposed again.

> 
> It probably is a bit fishy that we don't mask PMU interrupts vs
> this_cpu_cmpxchg() etc., but I don't think it's actually a bug given the
> few places using this_cpu_cmpxchg(). Though I could be wrong about that.
I will try to understand the concern and will try to come up with a patch update,
iff the performance number from the change could look reasonable and promising.
> 
> > diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
> > index 8e5b7d0b851c..ceab5df6e7ab 100644
> > --- a/arch/powerpc/include/asm/percpu.h
> > +++ b/arch/powerpc/include/asm/percpu.h
> > @@ -18,5 +18,22 @@
> >  #include <asm-generic/percpu.h>
> >  
> >  #include <asm/paca.h>
> > +#include <asm/cmpxchg.h>
> > +#ifdef this_cpu_cmpxchg_1
> > +#undef this_cpu_cmpxchg_1
>  
> If we need to undef then I think something has gone wrong with the
> header inclusion order, the model should be that the arch defines what
> it has and the generic code provides fallbacks if the arch didn't define
> anything.
> 
> > +#define this_cpu_cmpxchg_1(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 1)
> 
> I think this is unsafe vs preemption. The raw_cpu_ptr() can generate the
> per-cpu address, but then the task can be preempted and moved to a
> different CPU before the ldarx/stdcx do the cmpxchg.
> 
> The arm64 implementation had the same bug until they fixed it.
Thanks for the review, I will look deeper into this spot. I suppose, for per cpu api down to this level,
it is safe to assume it's safe in terms of being preemption-disabled. But, things could be mis-understood
and I can be wrong as well as linux kernel has been rapidly changing so much.:-(
> 
> > +#endif 
> > +#ifdef this_cpu_cmpxchg_2
> > +#undef this_cpu_cmpxchg_2
> > +#define this_cpu_cmpxchg_2(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 2)
> > +#endif
> > +#ifdef this_cpu_cmpxchg_4
> > +#undef this_cpu_cmpxchg_4
> > +#define this_cpu_cmpxchg_4(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 4)
> > +#endif
> > +#ifdef this_cpu_cmpxchg_8
> > +#undef this_cpu_cmpxchg_8
> > +#define this_cpu_cmpxchg_8(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 8)
> > +#endif
> >  
> >  #endif /* _ASM_POWERPC_PERCPU_H_ */
> 
> cheers
> 
best regards


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH 1/2] powerpc/locking: implement this_cpu_cmpxchg local API
  2023-12-15  8:50   ` Luming Yu
@ 2024-08-15  9:04     ` 虞陆铭
  0 siblings, 0 replies; 4+ messages in thread
From: 虞陆铭 @ 2024-08-15  9:04 UTC (permalink / raw)
  To: mpe
  Cc: linuxppc-dev, linux-kernel, npiggin, christophe.leroy, luming.yu,
	杨佳龙, shenghui.qu

Hi,

I just get bandwidth to do some perf test with the patch proposal on a baremetal p8 powernv platform.
It appears  there is +5% performance gain in unixbench, which looksr encouraging. : -)
	        System Benchmarks Index Score      	1
with patch	System Benchmarks Index Score  	1.05
		ratio: 1.05x

 I know performance is tricky. I will try to take more resource to deep dive with cpi breakdown and other perf tools
 to make sure the perf diff is really not from other random variables irrelevant to the patch proposal.

Stay tunned.

BR
Luming 
 
------------------ Original ------------------
From:  "虞陆铭"<luming.yu@shingroup.cn>;
Date:  Fri, Dec 15, 2023 04:50 PM
To:  "mpe"<mpe@ellerman.id.au>; 
Cc:  "linuxppc-dev"<linuxppc-dev@lists.ozlabs.org>; "linux-kernel"<linux-kernel@vger.kernel.org>; "npiggin"<npiggin@gmail.com>; "christophe.leroy"<christophe.leroy@csgroup.eu>; "luming.yu"<luming.yu@gmail.com>; "ke.zhao"<ke.zhao@shingroup.cn>; "dawei.li"<dawei.li@shingroup.cn>; "shenghui.qu"<shenghui.qu@shingroup.cn>; 
Subject:  Re: [PATCH 1/2] powerpc/locking: implement this_cpu_cmpxchg local API

 

On Mon, Dec 11, 2023 at 10:40:38PM +1100, Michael Ellerman wrote:
> Hi Luming Yu,
> 
> Luming Yu <luming.yu@shingroup.cn> writes:
> > ppc appears to have already supported cmpxchg-local atomic semantics
> > that is defined by the kernel convention of the feature.
> > Add this_cpu_cmpxchg ppc local for the performance benefit of arch
> > sepcific implementation than asm-generic c verison of the locking API.
> 
> Implementing this has been suggested before but it wasn't clear that it
> was actually going to perform better than the generic version.
Thanks for the info. To me, it is a news. : -)
I will check if any web search engine could serve me well to find it out. 
> 
> On 64-bit we have interrupt soft masking, so disabling interrupts is
> relatively cheap. So the generic this_cpu_cmpxhg using irq disable just
> becomes a few loads & stores, no atomic ops required.

something like this just popped up in my first try with a p8 test kvm on
a c1000 powernv8 platform?

I'm not sure the soft lockup is relevant to the interrupt soft masking,
but I will find it out for sure. : -)

[  460.217669] watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
[  460.217742] Modules linked in:
[  460.217828] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G        W    L   N 6.7.0-rc5+ #5
[  460.217915] Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1200 0xf000005 of:SLOF,git-6b6c16 pSeries
[  460.217999] NIP:  c00000000003e0ec LR: c00000000003e414 CTR: 0000000000000000
[  460.218074] REGS: c000000004797788 TRAP: 0900   Tainted: G        W    L   N  (6.7.0-rc5+)
[  460.218151] MSR:  8000000002009033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24042442  XER: 00000000
[  460.218342] CFAR: 0000000000000000 IRQMASK: 0
[  460.218342] GPR00: c00000000003e414 c000000004797760 c000000001583b00 c000000004797758
[  460.218342] GPR04: 0000000000000000 0000000000000004 c000000004ccf51c c00000000224e510
[  460.218342] GPR08: 4000000000000002 0000000000000049 c00000000457b280 0015000b00170038
[  460.218342] GPR12: 00340030003a0010 c000000002f40000 0000000000000008 c000000004ccf4fc
[  460.218342] GPR16: 0000000000007586 c0000000040f45c0 c000000004ddd080 c0000000040f45c0
[  460.218342] GPR20: 0000000000000008 0000000000000024 0000000000000004 c000000004ccf4fc
[  460.218342] GPR24: 000000000000003f 0000000000000001 0000000000000001 c000000004cc6e7e
[  460.218342] GPR28: fcffffffffffffff 0000000000000002 fcffffffffffffff 0000000000000003
[  460.219187] NIP [c00000000003e0ec] __replay_soft_interrupts+0x3c/0x160
[  460.219286] LR [c00000000003e414] arch_local_irq_restore+0x174/0x1a0
[  460.219365] Call Trace:
[  460.219400] [c000000004797760] [c00000000003e150] __replay_soft_interrupts+0xa0/0x160 (unreliable)
[  460.219515] [c000000004797910] [c00000000003e414] arch_local_irq_restore+0x174/0x1a0
[  460.219612] [c000000004797950] [c000000000a155c4] get_random_u32+0xb4/0x140
[  460.219699] [c0000000047979a0] [c0000000008b1ae0] get_rcw_we+0x138/0x274
[  460.219781] [c000000004797a30] [c00000000208d4bc] test_rslib_init+0x8b8/0xb70
[  460.219877] [c000000004797c40] [c00000000000fb80] do_one_initcall+0x60/0x390
[  460.219965] [c000000004797d10] [c000000002004a18] kernel_init_freeable+0x32c/0x3dc
[  460.220059] [c000000004797de0] [c0000000000102a4] kernel_init+0x34/0x1b0
[  460.220141] [c000000004797e50] [c00000000000cf14] ret_from_kernel_user_thread+0x14/0x1c
[  460.220229] --- interrupt: 0 at 0x0
[  460.220291] Code: 60000000 7c0802a6 f8010010 f821fe51 e92d0c78 f92101a8 39200000 38610028 892d0933 61290040 992d0933 48043a3d <60000000> 39200000 e9410130 f9210160
[  460.955369] Testing (23,15)_64 code...

> 
> In contrast implementing it using __cmpxchg_local() will use
> ldarx/stdcx etc. which will be more expensive.
> 
> Have you done any performance measurements?
Yes, I'm looking for resource to track the perf changes (positive or negative)
in this corner for this patch set being proposed again.

> 
> It probably is a bit fishy that we don't mask PMU interrupts vs
> this_cpu_cmpxchg() etc., but I don't think it's actually a bug given the
> few places using this_cpu_cmpxchg(). Though I could be wrong about that.
I will try to understand the concern and will try to come up with a patch update,
iff the performance number from the change could look reasonable and promising.
> 
> > diff --git a/arch/powerpc/include/asm/percpu.h b/arch/powerpc/include/asm/percpu.h
> > index 8e5b7d0b851c..ceab5df6e7ab 100644
> > --- a/arch/powerpc/include/asm/percpu.h
> > +++ b/arch/powerpc/include/asm/percpu.h
> > @@ -18,5 +18,22 @@
> >  #include <asm-generic/percpu.h>
> >  
> >  #include <asm/paca.h>
> > +#include <asm/cmpxchg.h>
> > +#ifdef this_cpu_cmpxchg_1
> > +#undef this_cpu_cmpxchg_1
>  
> If we need to undef then I think something has gone wrong with the
> header inclusion order, the model should be that the arch defines what
> it has and the generic code provides fallbacks if the arch didn't define
> anything.
> 
> > +#define this_cpu_cmpxchg_1(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 1)
> 
> I think this is unsafe vs preemption. The raw_cpu_ptr() can generate the
> per-cpu address, but then the task can be preempted and moved to a
> different CPU before the ldarx/stdcx do the cmpxchg.
> 
> The arm64 implementation had the same bug until they fixed it.
Thanks for the review, I will look deeper into this spot. I suppose, for per cpu api down to this level,
it is safe to assume it's safe in terms of being preemption-disabled. But, things could be mis-understood
and I can be wrong as well as linux kernel has been rapidly changing so much.:-(
> 
> > +#endif 
> > +#ifdef this_cpu_cmpxchg_2
> > +#undef this_cpu_cmpxchg_2
> > +#define this_cpu_cmpxchg_2(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 2)
> > +#endif
> > +#ifdef this_cpu_cmpxchg_4
> > +#undef this_cpu_cmpxchg_4
> > +#define this_cpu_cmpxchg_4(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 4)
> > +#endif
> > +#ifdef this_cpu_cmpxchg_8
> > +#undef this_cpu_cmpxchg_8
> > +#define this_cpu_cmpxchg_8(pcp, oval, nval)	__cmpxchg_local(raw_cpu_ptr(&(pcp)), oval, nval, 8)
> > +#endif
> >  
> >  #endif /* _ASM_POWERPC_PERCPU_H_ */
> 
> cheers
> 
best regards

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-08-15  9:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-12-04  2:23 [PATCH 1/2] powerpc/locking: implement this_cpu_cmpxchg local API Luming Yu
2023-12-11 11:40 ` Michael Ellerman
2023-12-15  8:50   ` Luming Yu
2024-08-15  9:04     ` 虞陆铭

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).