[PATCH] sched/membarrier: Fix redundant load of membarrier

llvm.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [PATCH] sched/membarrier: Fix redundant load of membarrier_state
@ 2024-10-07  5:39 Nysal Jan K.A.
  2024-10-25  0:29 ` Michael Ellerman
  2024-10-29  5:51 ` [PATCH v2] " Nysal Jan K.A.
  0 siblings, 2 replies; 15+ messages in thread
From: Nysal Jan K.A. @ 2024-10-07  5:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linuxppc-dev, Nysal Jan K.A, Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Justin Stitt, Vlastimil Babka, Kent Overstreet,
	Rick Edgecombe, Roman Gushchin, linux-kernel, llvm

From: "Nysal Jan K.A" <nysal@linux.ibm.com>

On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
is not selected, sync_core_before_usermode() is a no-op.
In membarrier_mm_sync_core_before_usermode() the compiler does not
eliminate redundant branches and the load of mm->membarrier_state
for this case as the atomic_read() cannot be optimized away.

Here's a snippet of the code generated for finish_task_switch() on powerpc:

1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
.......
1b78c8:   cmpdi   cr7,r26,0
1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
1b78d0:   ld      r9,2312(r13)    # current
1b78d4:   ld      r9,1888(r9)     # current->mm
1b78d8:   cmpd    cr7,r26,r9
1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
1b78e0:   hwsync
1b78e4:   cmplwi  cr7,r27,128
.......
1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
1b7a74:   b       1b78e0 <finish_task_switch+0xcc>

This was found while analyzing "perf c2c" reports on kernels prior
to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
where mm_count was false sharing with membarrier_state.

There is a minor improvement in the size of finish_task_switch().
The following are results from bloat-o-meter:

GCC 7.5.0:
----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch                           884     852     -32

GCC 12.2.1:
-----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch.isra                      852     820     -32

LLVM 17.0.6:
------------
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
Function                                     old     new   delta
rt_mutex_schedule                            120     104     -16
finish_task_switch                           792     772     -20

Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
---
 include/linux/sched/mm.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 07bb8d4181d7..042e60ab853a 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -540,6 +540,8 @@ enum {
 
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
+		return;
 	if (current->mm != mm)
 		return;
 	if (likely(!(atomic_read(&mm->membarrier_state) &
-- 
2.35.3


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-07  5:39 [PATCH] sched/membarrier: Fix redundant load of membarrier_state Nysal Jan K.A.
@ 2024-10-25  0:29 ` Michael Ellerman
  2024-10-25  2:40   ` Stephen Rothwell
                     ` (3 more replies)
  2024-10-29  5:51 ` [PATCH v2] " Nysal Jan K.A.
  1 sibling, 4 replies; 15+ messages in thread
From: Michael Ellerman @ 2024-10-25  0:29 UTC (permalink / raw)
  To: Nysal Jan K.A., Andrew Morton, Mathieu Desnoyers
  Cc: linuxppc-dev, Nysal Jan K.A, Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Justin Stitt, Vlastimil Babka, Kent Overstreet,
	Rick Edgecombe, Roman Gushchin, linux-kernel, llvm

[To += Mathieu]

"Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> From: "Nysal Jan K.A" <nysal@linux.ibm.com>
>
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and the load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.

I was wondering if this was caused by powerpc's arch_atomic_read() which
uses asm volatile.

But replacing arch_atomic_read() with READ_ONCE() makes no difference,
presumably because the compiler still can't see that the READ_ONCE() is
unnecessary (which is kind of by design).

> Here's a snippet of the code generated for finish_task_switch() on powerpc:
>
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
>
> This was found while analyzing "perf c2c" reports on kernels prior
> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> where mm_count was false sharing with membarrier_state.

So it was causing a noticable performance blip? But isn't anymore?

> There is a minor improvement in the size of finish_task_switch().
> The following are results from bloat-o-meter:
>
> GCC 7.5.0:
> ----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch                           884     852     -32
>
> GCC 12.2.1:
> -----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch.isra                      852     820     -32

GCC 12 is a couple of years old, I assume GCC 14 behaves similarly?

> LLVM 17.0.6:
> ------------
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> Function                                     old     new   delta
> rt_mutex_schedule                            120     104     -16
> finish_task_switch                           792     772     -20
>
> Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
> ---
>  include/linux/sched/mm.h | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 07bb8d4181d7..042e60ab853a 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -540,6 +540,8 @@ enum {
>  
>  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>  {
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> +		return;
>  	if (current->mm != mm)
>  		return;
>  	if (likely(!(atomic_read(&mm->membarrier_state) &

The other option would be to have a completely separate stub, eg:

  #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
  {
          if (current->mm != mm)
                  return;
          if (likely(!(atomic_read(&mm->membarrier_state) &
                       MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
                  return;
          sync_core_before_usermode();
  }
  #else
  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
  #endif

Not sure what folks prefer.

In either case I think it's probably worth a short comment explaining
why it's worth the trouble (ie. that the atomic_read() prevents the
compiler from doing DCE).

cheers

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-25  0:29 ` Michael Ellerman
@ 2024-10-25  2:40   ` Stephen Rothwell
  2024-10-25  3:22   ` Segher Boessenkool
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 15+ messages in thread
From: Stephen Rothwell @ 2024-10-25  2:40 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nysal Jan K.A., Andrew Morton, Mathieu Desnoyers, linuxppc-dev,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Vlastimil Babka, Kent Overstreet, Rick Edgecombe, Roman Gushchin,
	linux-kernel, llvm

[-- Attachment #1: Type: text/plain, Size: 1479 bytes --]

Hi Michael,

On Fri, 25 Oct 2024 11:29:38 +1100 Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index 07bb8d4181d7..042e60ab853a 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -540,6 +540,8 @@ enum {
> >  
> >  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
> >  {
> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> > +		return;
> >  	if (current->mm != mm)
> >  		return;
> >  	if (likely(!(atomic_read(&mm->membarrier_state) &  
> 
> The other option would be to have a completely separate stub, eg:
> 
>   #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>   {
>           if (current->mm != mm)
>                   return;
>           if (likely(!(atomic_read(&mm->membarrier_state) &
>                        MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>                   return;
>           sync_core_before_usermode();
>   }
>   #else
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
>   #endif
> 
> Not sure what folks prefer.

I case it matters, in general I prefer the first as there is less code
to get out of sync and all the code still gets parsed by the compiler
whether the CONFIG option is set or not.
-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-25  0:29 ` Michael Ellerman
  2024-10-25  2:40   ` Stephen Rothwell
@ 2024-10-25  3:22   ` Segher Boessenkool
  2024-10-25 12:40   ` Mathieu Desnoyers
  2024-10-25 18:30   ` Nysal Jan K.A.
  3 siblings, 0 replies; 15+ messages in thread
From: Segher Boessenkool @ 2024-10-25  3:22 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nysal Jan K.A., Andrew Morton, Mathieu Desnoyers, linuxppc-dev,
	Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
	Vlastimil Babka, Kent Overstreet, Rick Edgecombe, Roman Gushchin,
	linux-kernel, llvm

Hi!

On Fri, Oct 25, 2024 at 11:29:38AM +1100, Michael Ellerman wrote:
> [To += Mathieu]
> 
> "Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> > From: "Nysal Jan K.A" <nysal@linux.ibm.com>
> >
> > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> > is not selected, sync_core_before_usermode() is a no-op.
> > In membarrier_mm_sync_core_before_usermode() the compiler does not
> > eliminate redundant branches and the load of mm->membarrier_state
> > for this case as the atomic_read() cannot be optimized away.
> 
> I was wondering if this was caused by powerpc's arch_atomic_read() which
> uses asm volatile.
> 
> But replacing arch_atomic_read() with READ_ONCE() makes no difference,
> presumably because the compiler still can't see that the READ_ONCE() is
> unnecessary (which is kind of by design).

Exactly.

> > GCC 12.2.1:
> > -----------
> > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> > Function                                     old     new   delta
> > finish_task_switch.isra                      852     820     -32
> 
> GCC 12 is a couple of years old, I assume GCC 14 behaves similarly?

GCC 12 is still being actively developed.  There will be a 12.5
probably halfway next year (and that will be the last 12.x release,
yes).  The GCC homepage (<https://gcc.gnu.org>) will tell you what
releases are still maintained/supported, and sometimes you can derive
our planned plans from there as well :-)

But yes, 14 is similar (I did not test, but I feel confident making that
assertion :-) )

> >  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
> >  {
> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> > +		return;
> >  	if (current->mm != mm)
> >  		return;
> >  	if (likely(!(atomic_read(&mm->membarrier_state) &
> 
> The other option would be to have a completely separate stub, eg:
> 
>   #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>   {
>           if (current->mm != mm)
>                   return;
>           if (likely(!(atomic_read(&mm->membarrier_state) &
>                        MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>                   return;
>           sync_core_before_usermode();
>   }
>   #else
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
>   #endif
> 
> Not sure what folks prefer.
> 
> In either case I think it's probably worth a short comment explaining
> why it's worth the trouble (ie. that the atomic_read() prevents the
> compiler from doing DCE).

Since you ask, I like the proposed change (the inline one) best.  But
yeah, comment please!

(And it is not about DCE -- just the definition of __READ_ONCE makes it
directly impossible to CSE any expressions with this, it (standards-
violatingly) casts the pointers to pointers to volatile, and you cannot
CSE any accesses to volatile objects!)

So what are the actual semantics the kernel wants from its READ_ONCE,
and from its atomics in general?  GCC has perfectly fine in-compiler
support for such things, there is no need for making a second rate
manual implementation of parts of this, when you can use a good
implementation of everything instead!


Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-25  0:29 ` Michael Ellerman
  2024-10-25  2:40   ` Stephen Rothwell
  2024-10-25  3:22   ` Segher Boessenkool
@ 2024-10-25 12:40   ` Mathieu Desnoyers
  2024-10-25 18:30   ` Nysal Jan K.A.
  3 siblings, 0 replies; 15+ messages in thread
From: Mathieu Desnoyers @ 2024-10-25 12:40 UTC (permalink / raw)
  To: Michael Ellerman, Nysal Jan K.A., Andrew Morton
  Cc: linuxppc-dev, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Vlastimil Babka, Kent Overstreet, Rick Edgecombe,
	Roman Gushchin, linux-kernel, llvm

On 2024-10-24 20:29, Michael Ellerman wrote:
> [To += Mathieu]
> 
> "Nysal Jan K.A." <nysal@linux.ibm.com> writes:
>> From: "Nysal Jan K.A" <nysal@linux.ibm.com>
>>
>> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>> is not selected, sync_core_before_usermode() is a no-op.
>> In membarrier_mm_sync_core_before_usermode() the compiler does not
>> eliminate redundant branches and the load of mm->membarrier_state
>> for this case as the atomic_read() cannot be optimized away.
> 
> I was wondering if this was caused by powerpc's arch_atomic_read() which
> uses asm volatile.
> 
> But replacing arch_atomic_read() with READ_ONCE() makes no difference,
> presumably because the compiler still can't see that the READ_ONCE() is
> unnecessary (which is kind of by design).
> 
>> Here's a snippet of the code generated for finish_task_switch() on powerpc:
>>
>> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
>> .......
>> 1b78c8:   cmpdi   cr7,r26,0
>> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
>> 1b78d0:   ld      r9,2312(r13)    # current
>> 1b78d4:   ld      r9,1888(r9)     # current->mm
>> 1b78d8:   cmpd    cr7,r26,r9
>> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
>> 1b78e0:   hwsync
>> 1b78e4:   cmplwi  cr7,r27,128
>> .......
>> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
>> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
>>
>> This was found while analyzing "perf c2c" reports on kernels prior
>> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
>> where mm_count was false sharing with membarrier_state.
> 
> So it was causing a noticable performance blip? But isn't anymore?

I indeed moved mm_count into its own cacheline in response to
performance regressions reports, which were caused by simply
loading the pcpu_cid pointer frequently enough. So if membarrier_state
was also sharing that cache line, it makes sense that moving
mm_count away helped there as well.

[...]

>> ---
>>   include/linux/sched/mm.h | 2 ++
>>   1 file changed, 2 insertions(+)
>>
>> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
>> index 07bb8d4181d7..042e60ab853a 100644
>> --- a/include/linux/sched/mm.h
>> +++ b/include/linux/sched/mm.h
>> @@ -540,6 +540,8 @@ enum {
>>   
>>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>>   {
>> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
>> +		return;
>>   	if (current->mm != mm)
>>   		return;
>>   	if (likely(!(atomic_read(&mm->membarrier_state) &

I prefer the approach above, because it requires fewer kernel
configurations to reach the same compilation code coverage.

Thanks,

Mathieu


> 
> The other option would be to have a completely separate stub, eg:
> 
>    #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>    static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>    {
>            if (current->mm != mm)
>                    return;
>            if (likely(!(atomic_read(&mm->membarrier_state) &
>                         MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>                    return;
>            sync_core_before_usermode();
>    }
>    #else
>    static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
>    #endif
> 
> Not sure what folks prefer.
> 
> In either case I think it's probably worth a short comment explaining
> why it's worth the trouble (ie. that the atomic_read() prevents the
> compiler from doing DCE).
> 
> cheers

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-25  0:29 ` Michael Ellerman
                     ` (2 preceding siblings ...)
  2024-10-25 12:40   ` Mathieu Desnoyers
@ 2024-10-25 18:30   ` Nysal Jan K.A.
  3 siblings, 0 replies; 15+ messages in thread
From: Nysal Jan K.A. @ 2024-10-25 18:30 UTC (permalink / raw)
  To: Michael Ellerman, Andrew Morton, Mathieu Desnoyers
  Cc: linuxppc-dev, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Vlastimil Babka, Kent Overstreet, Rick Edgecombe,
	Roman Gushchin, linux-kernel, llvm

On Fri, Oct 25, 2024 at 11:29:38AM +1100, Michael Ellerman wrote:
> [To += Mathieu]
> 
> "Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> > From: "Nysal Jan K.A" <nysal@linux.ibm.com>
> >
> > On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> > is not selected, sync_core_before_usermode() is a no-op.
> > In membarrier_mm_sync_core_before_usermode() the compiler does not
> > eliminate redundant branches and the load of mm->membarrier_state
> > for this case as the atomic_read() cannot be optimized away.
> 
> I was wondering if this was caused by powerpc's arch_atomic_read() which
> uses asm volatile.
> 

Yes, that's my understanding as well

> But replacing arch_atomic_read() with READ_ONCE() makes no difference,
> presumably because the compiler still can't see that the READ_ONCE() is
> unnecessary (which is kind of by design).
> 

In READ_ONCE() we cast to a volatile pointer, I think the compiler cannot eliminate
the code in that case.

> > Here's a snippet of the code generated for finish_task_switch() on powerpc:
> >
> > 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> > .......
> > 1b78c8:   cmpdi   cr7,r26,0
> > 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> > 1b78d0:   ld      r9,2312(r13)    # current
> > 1b78d4:   ld      r9,1888(r9)     # current->mm
> > 1b78d8:   cmpd    cr7,r26,r9
> > 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> > 1b78e0:   hwsync
> > 1b78e4:   cmplwi  cr7,r27,128
> > .......
> > 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> > 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
> >
> > This was found while analyzing "perf c2c" reports on kernels prior
> > to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> > where mm_count was false sharing with membarrier_state.
> 
> So it was causing a noticable performance blip? But isn't anymore?
> 

It was noticeable in that it showed up amongst the top entries in perf c2c reports.
There was similar false sharing with other fields that share the cache line with
mm_count, so the gains were minimal with just this patch. c1753fd02a00 addresses
these cases too.

> > There is a minor improvement in the size of finish_task_switch().
> > The following are results from bloat-o-meter:
> >
> > GCC 7.5.0:
> > ----------
> > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> > Function                                     old     new   delta
> > finish_task_switch                           884     852     -32
> >
> > GCC 12.2.1:
> > -----------
> > add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> > Function                                     old     new   delta
> > finish_task_switch.isra                      852     820     -32
> 
> GCC 12 is a couple of years old, I assume GCC 14 behaves similarly?
> 

I cross compiled for aarch64 with gcc 14.1.1 and see similar results:

add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
Function                                     old     new   delta
get_nohz_timer_target                        352     356      +4
e843419@0b02_0000d7e7_408                      8       -      -8
e843419@01bb_000021d2_868                      8       -      -8
finish_task_switch.isra                      592     548     -44
Total: Before=31013792, After=31013736, chg -0.00%

> > LLVM 17.0.6:
> > ------------
> > add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> > Function                                     old     new   delta
> > rt_mutex_schedule                            120     104     -16
> > finish_task_switch                           792     772     -20
> >
> > Signed-off-by: Nysal Jan K.A <nysal@linux.ibm.com>
> > ---
> >  include/linux/sched/mm.h | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> > index 07bb8d4181d7..042e60ab853a 100644
> > --- a/include/linux/sched/mm.h
> > +++ b/include/linux/sched/mm.h
> > @@ -540,6 +540,8 @@ enum {
> >  
> >  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
> >  {
> > +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> > +		return;
> >  	if (current->mm != mm)
> >  		return;
> >  	if (likely(!(atomic_read(&mm->membarrier_state) &
> 
> The other option would be to have a completely separate stub, eg:
> 
>   #ifdef CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>   {
>           if (current->mm != mm)
>                   return;
>           if (likely(!(atomic_read(&mm->membarrier_state) &
>                        MEMBARRIER_STATE_PRIVATE_EXPEDITED_SYNC_CORE)))
>                   return;
>           sync_core_before_usermode();
>   }
>   #else
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm) { }
>   #endif
> 
> Not sure what folks prefer.
> 
> In either case I think it's probably worth a short comment explaining
> why it's worth the trouble (ie. that the atomic_read() prevents the
> compiler from doing DCE).
>

I'll send a v2 with a comment added in there. Thanks for the review.

--Nysal


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-07  5:39 [PATCH] sched/membarrier: Fix redundant load of membarrier_state Nysal Jan K.A.
  2024-10-25  0:29 ` Michael Ellerman
@ 2024-10-29  5:51 ` Nysal Jan K.A.
  2024-10-29 17:51   ` Mathieu Desnoyers
                     ` (4 more replies)
  1 sibling, 5 replies; 15+ messages in thread
From: Nysal Jan K.A. @ 2024-10-29  5:51 UTC (permalink / raw)
  To: Andrew Morton, Michael Ellerman, Mathieu Desnoyers
  Cc: Segher Boessenkool, Stephen Rothwell, Peter Zijlstra,
	linuxppc-dev, Nysal Jan K.A., Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Justin Stitt, Vlastimil Babka, Liam R. Howlett,
	Mark Brown, Michal Hocko, Kent Overstreet, Rick Edgecombe,
	linux-kernel, llvm

On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
is not selected, sync_core_before_usermode() is a no-op.
In membarrier_mm_sync_core_before_usermode() the compiler does not
eliminate redundant branches and load of mm->membarrier_state
for this case as the atomic_read() cannot be optimized away.

Here's a snippet of the code generated for finish_task_switch() on powerpc
prior to this change:

1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
.......
1b78c8:   cmpdi   cr7,r26,0
1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
1b78d0:   ld      r9,2312(r13)    # current
1b78d4:   ld      r9,1888(r9)     # current->mm
1b78d8:   cmpd    cr7,r26,r9
1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
1b78e0:   hwsync
1b78e4:   cmplwi  cr7,r27,128
.......
1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
1b7a74:   b       1b78e0 <finish_task_switch+0xcc>

This was found while analyzing "perf c2c" reports on kernels prior
to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
where mm_count was false sharing with membarrier_state.

There is a minor improvement in the size of finish_task_switch().
The following are results from bloat-o-meter for ppc64le:

GCC 7.5.0
---------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch                           884     852     -32

GCC 12.2.1
----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch.isra                      852     820     -32

LLVM 17.0.6
-----------
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
Function                                     old     new   delta
rt_mutex_schedule                            120     104     -16
finish_task_switch                           792     772     -20

Results on aarch64:

GCC 14.1.1
----------
add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
Function                                     old     new   delta
get_nohz_timer_target                        352     356      +4
e843419@0b02_0000d7e7_408                      8       -      -8
e843419@01bb_000021d2_868                      8       -      -8
finish_task_switch.isra                      592     548     -44

Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
---
V1 -> V2:
- Add results for aarch64
- Add a comment describing the changes
---
 include/linux/sched/mm.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 928a626725e6..b13474825130 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -531,6 +531,13 @@ enum {
 
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
+	/*
+	 * The atomic_read() below prevents CSE. The following should
+	 * help the compiler generate more efficient code on architectures
+	 * where sync_core_before_usermode() is a no-op.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
+		return;
 	if (current->mm != mm)
 		return;
 	if (likely(!(atomic_read(&mm->membarrier_state) &
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-29  5:51 ` [PATCH v2] " Nysal Jan K.A.
@ 2024-10-29 17:51   ` Mathieu Desnoyers
  2024-10-29 23:29   ` Michael Ellerman
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Mathieu Desnoyers @ 2024-10-29 17:51 UTC (permalink / raw)
  To: Nysal Jan K.A., Andrew Morton, Michael Ellerman
  Cc: Segher Boessenkool, Stephen Rothwell, Peter Zijlstra,
	linuxppc-dev, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Vlastimil Babka, Liam R. Howlett, Mark Brown,
	Michal Hocko, Kent Overstreet, Rick Edgecombe, linux-kernel, llvm

On 2024-10-29 01:51, Nysal Jan K.A. wrote:
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.
> 
> Here's a snippet of the code generated for finish_task_switch() on powerpc
> prior to this change:
> 
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
> 
> This was found while analyzing "perf c2c" reports on kernels prior
> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> where mm_count was false sharing with membarrier_state.
> 
> There is a minor improvement in the size of finish_task_switch().
> The following are results from bloat-o-meter for ppc64le:
> 
> GCC 7.5.0
> ---------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch                           884     852     -32
> 
> GCC 12.2.1
> ----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch.isra                      852     820     -32
> 
> LLVM 17.0.6
> -----------
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> Function                                     old     new   delta
> rt_mutex_schedule                            120     104     -16
> finish_task_switch                           792     772     -20
> 
> Results on aarch64:
> 
> GCC 14.1.1
> ----------
> add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
> Function                                     old     new   delta
> get_nohz_timer_target                        352     356      +4
> e843419@0b02_0000d7e7_408                      8       -      -8
> e843419@01bb_000021d2_868                      8       -      -8
> finish_task_switch.isra                      592     548     -44
> 
> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>

Thanks!

Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>

> ---
> V1 -> V2:
> - Add results for aarch64
> - Add a comment describing the changes
> ---
>   include/linux/sched/mm.h | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 928a626725e6..b13474825130 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -531,6 +531,13 @@ enum {
>   
>   static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>   {
> +	/*
> +	 * The atomic_read() below prevents CSE. The following should
> +	 * help the compiler generate more efficient code on architectures
> +	 * where sync_core_before_usermode() is a no-op.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> +		return;
>   	if (current->mm != mm)
>   		return;
>   	if (likely(!(atomic_read(&mm->membarrier_state) &

-- 
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-29  5:51 ` [PATCH v2] " Nysal Jan K.A.
  2024-10-29 17:51   ` Mathieu Desnoyers
@ 2024-10-29 23:29   ` Michael Ellerman
  2024-10-30 13:33   ` Segher Boessenkool
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 15+ messages in thread
From: Michael Ellerman @ 2024-10-29 23:29 UTC (permalink / raw)
  To: Nysal Jan K.A., Andrew Morton, Mathieu Desnoyers
  Cc: Segher Boessenkool, Stephen Rothwell, Peter Zijlstra,
	linuxppc-dev, Nysal Jan K.A., Nathan Chancellor, Nick Desaulniers,
	Bill Wendling, Justin Stitt, Vlastimil Babka, Liam R. Howlett,
	Mark Brown, Michal Hocko, Kent Overstreet, Rick Edgecombe,
	linux-kernel, llvm

"Nysal Jan K.A." <nysal@linux.ibm.com> writes:
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.
>
> Here's a snippet of the code generated for finish_task_switch() on powerpc
> prior to this change:
>
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>

Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>

cheers

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-29  5:51 ` [PATCH v2] " Nysal Jan K.A.
  2024-10-29 17:51   ` Mathieu Desnoyers
  2024-10-29 23:29   ` Michael Ellerman
@ 2024-10-30 13:33   ` Segher Boessenkool
  2024-11-18  9:04   ` Michal Hocko
  2025-03-03  6:04   ` [PATCH v2 RESEND] " Nysal Jan K.A.
  4 siblings, 0 replies; 15+ messages in thread
From: Segher Boessenkool @ 2024-10-30 13:33 UTC (permalink / raw)
  To: Nysal Jan K.A.
  Cc: Andrew Morton, Michael Ellerman, Mathieu Desnoyers,
	Stephen Rothwell, Peter Zijlstra, linuxppc-dev, Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Vlastimil Babka,
	Liam R. Howlett, Mark Brown, Michal Hocko, Kent Overstreet,
	Rick Edgecombe, linux-kernel, llvm

Hi!

On Tue, Oct 29, 2024 at 11:21:28AM +0530, Nysal Jan K.A. wrote:
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.
> 
> Here's a snippet of the code generated for finish_task_switch() on powerpc
> prior to this change:
> 
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
> 
> This was found while analyzing "perf c2c" reports on kernels prior
> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> where mm_count was false sharing with membarrier_state.
> 
> There is a minor improvement in the size of finish_task_switch().
> The following are results from bloat-o-meter for ppc64le:
> 
> GCC 7.5.0
> ---------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch                           884     852     -32
> 
> GCC 12.2.1
> ----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch.isra                      852     820     -32
> 
> LLVM 17.0.6
> -----------
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> Function                                     old     new   delta
> rt_mutex_schedule                            120     104     -16
> finish_task_switch                           792     772     -20
> 
> Results on aarch64:
> 
> GCC 14.1.1
> ----------
> add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
> Function                                     old     new   delta
> get_nohz_timer_target                        352     356      +4
> e843419@0b02_0000d7e7_408                      8       -      -8
> e843419@01bb_000021d2_868                      8       -      -8
> finish_task_switch.isra                      592     548     -44
> 
> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
> ---
> V1 -> V2:
> - Add results for aarch64
> - Add a comment describing the changes
> ---
>  include/linux/sched/mm.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 928a626725e6..b13474825130 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -531,6 +531,13 @@ enum {
>  
>  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>  {
> +	/*
> +	 * The atomic_read() below prevents CSE. The following should
> +	 * help the compiler generate more efficient code on architectures
> +	 * where sync_core_before_usermode() is a no-op.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> +		return;
>  	if (current->mm != mm)
>  		return;
>  	if (likely(!(atomic_read(&mm->membarrier_state) &

I'd say "CSE and similar transformations", but yeah, in this case CSE.
The point is that any access to a volatile object is a necessary side-
effect, so it has to be performed on the actual machine just as on the
abstract machine (on all the same paths, and as often).  It might be
nice to have an atomic_read (for PowerPC) that can generate better
machine code.  Not a trivial task though!

Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>


Segher

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-29  5:51 ` [PATCH v2] " Nysal Jan K.A.
                     ` (2 preceding siblings ...)
  2024-10-30 13:33   ` Segher Boessenkool
@ 2024-11-18  9:04   ` Michal Hocko
  2024-11-18  9:25     ` Peter Zijlstra
  2025-03-03  6:04   ` [PATCH v2 RESEND] " Nysal Jan K.A.
  4 siblings, 1 reply; 15+ messages in thread
From: Michal Hocko @ 2024-11-18  9:04 UTC (permalink / raw)
  To: Nysal Jan K.A.
  Cc: Andrew Morton, Michael Ellerman, Mathieu Desnoyers,
	Segher Boessenkool, Stephen Rothwell, Peter Zijlstra,
	linuxppc-dev, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Vlastimil Babka, Liam R. Howlett, Mark Brown,
	Kent Overstreet, Rick Edgecombe, linux-kernel, llvm

I do not see this patch staged in any tree (e.g. linux-next). Is this on
its way to be merged?

Thanks!

On Tue 29-10-24 11:21:28, Nysal Jan K.A. wrote:
> On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
> is not selected, sync_core_before_usermode() is a no-op.
> In membarrier_mm_sync_core_before_usermode() the compiler does not
> eliminate redundant branches and load of mm->membarrier_state
> for this case as the atomic_read() cannot be optimized away.
> 
> Here's a snippet of the code generated for finish_task_switch() on powerpc
> prior to this change:
> 
> 1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
> .......
> 1b78c8:   cmpdi   cr7,r26,0
> 1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
> 1b78d0:   ld      r9,2312(r13)    # current
> 1b78d4:   ld      r9,1888(r9)     # current->mm
> 1b78d8:   cmpd    cr7,r26,r9
> 1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
> 1b78e0:   hwsync
> 1b78e4:   cmplwi  cr7,r27,128
> .......
> 1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
> 1b7a74:   b       1b78e0 <finish_task_switch+0xcc>
> 
> This was found while analyzing "perf c2c" reports on kernels prior
> to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
> where mm_count was false sharing with membarrier_state.
> 
> There is a minor improvement in the size of finish_task_switch().
> The following are results from bloat-o-meter for ppc64le:
> 
> GCC 7.5.0
> ---------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch                           884     852     -32
> 
> GCC 12.2.1
> ----------
> add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
> Function                                     old     new   delta
> finish_task_switch.isra                      852     820     -32
> 
> LLVM 17.0.6
> -----------
> add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
> Function                                     old     new   delta
> rt_mutex_schedule                            120     104     -16
> finish_task_switch                           792     772     -20
> 
> Results on aarch64:
> 
> GCC 14.1.1
> ----------
> add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
> Function                                     old     new   delta
> get_nohz_timer_target                        352     356      +4
> e843419@0b02_0000d7e7_408                      8       -      -8
> e843419@01bb_000021d2_868                      8       -      -8
> finish_task_switch.isra                      592     548     -44
> 
> Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
> ---
> V1 -> V2:
> - Add results for aarch64
> - Add a comment describing the changes
> ---
>  include/linux/sched/mm.h | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
> index 928a626725e6..b13474825130 100644
> --- a/include/linux/sched/mm.h
> +++ b/include/linux/sched/mm.h
> @@ -531,6 +531,13 @@ enum {
>  
>  static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
>  {
> +	/*
> +	 * The atomic_read() below prevents CSE. The following should
> +	 * help the compiler generate more efficient code on architectures
> +	 * where sync_core_before_usermode() is a no-op.
> +	 */
> +	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
> +		return;
>  	if (current->mm != mm)
>  		return;
>  	if (likely(!(atomic_read(&mm->membarrier_state) &
> -- 
> 2.47.0
> 

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
  2024-11-18  9:04   ` Michal Hocko
@ 2024-11-18  9:25     ` Peter Zijlstra
  2024-11-18  9:51       ` Michal Hocko
  2025-01-09  8:46       ` Michal Hocko
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Zijlstra @ 2024-11-18  9:25 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Nysal Jan K.A., Andrew Morton, Michael Ellerman,
	Mathieu Desnoyers, Segher Boessenkool, Stephen Rothwell,
	linuxppc-dev, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Vlastimil Babka, Liam R. Howlett, Mark Brown,
	Kent Overstreet, Rick Edgecombe, linux-kernel, llvm

On Mon, Nov 18, 2024 at 10:04:18AM +0100, Michal Hocko wrote:
> I do not see this patch staged in any tree (e.g. linux-next). Is this on
> its way to be merged?

I only now found it -- it doesn't look super urgent. I'll get it into a
git tree after -rc1.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
  2024-11-18  9:25     ` Peter Zijlstra
@ 2024-11-18  9:51       ` Michal Hocko
  2025-01-09  8:46       ` Michal Hocko
  1 sibling, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2024-11-18  9:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nysal Jan K.A., Andrew Morton, Michael Ellerman,
	Mathieu Desnoyers, Segher Boessenkool, Stephen Rothwell,
	linuxppc-dev, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Vlastimil Babka, Liam R. Howlett, Mark Brown,
	Kent Overstreet, Rick Edgecombe, linux-kernel, llvm

On Mon 18-11-24 10:25:17, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 10:04:18AM +0100, Michal Hocko wrote:
> > I do not see this patch staged in any tree (e.g. linux-next). Is this on
> > its way to be merged?
> 
> I only now found it -- it doesn't look super urgent. I'll get it into a
> git tree after -rc1.

Thanks!

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2] sched/membarrier: Fix redundant load of membarrier_state
  2024-11-18  9:25     ` Peter Zijlstra
  2024-11-18  9:51       ` Michal Hocko
@ 2025-01-09  8:46       ` Michal Hocko
  1 sibling, 0 replies; 15+ messages in thread
From: Michal Hocko @ 2025-01-09  8:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nysal Jan K.A., Andrew Morton, Michael Ellerman,
	Mathieu Desnoyers, Segher Boessenkool, Stephen Rothwell,
	linuxppc-dev, Nathan Chancellor, Nick Desaulniers, Bill Wendling,
	Justin Stitt, Vlastimil Babka, Liam R. Howlett, Mark Brown,
	Kent Overstreet, Rick Edgecombe, linux-kernel, llvm

On Mon 18-11-24 10:25:17, Peter Zijlstra wrote:
> On Mon, Nov 18, 2024 at 10:04:18AM +0100, Michal Hocko wrote:
> > I do not see this patch staged in any tree (e.g. linux-next). Is this on
> > its way to be merged?
> 
> I only now found it -- it doesn't look super urgent. I'll get it into a
> git tree after -rc1.

I cannot seem to be able to find it in linux-next. Which tree is this
sitting in?

Thanks!
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 RESEND] sched/membarrier: Fix redundant load of membarrier_state
  2024-10-29  5:51 ` [PATCH v2] " Nysal Jan K.A.
                     ` (3 preceding siblings ...)
  2024-11-18  9:04   ` Michal Hocko
@ 2025-03-03  6:04   ` Nysal Jan K.A.
  4 siblings, 0 replies; 15+ messages in thread
From: Nysal Jan K.A. @ 2025-03-03  6:04 UTC (permalink / raw)
  To: Peter Zijlstra, Andrew Morton, Ingo Molnar
  Cc: Michael Ellerman, Mathieu Desnoyers, Segher Boessenkool,
	Stephen Rothwell, linuxppc-dev, Nysal Jan K.A., Nathan Chancellor,
	Nick Desaulniers, Bill Wendling, Justin Stitt, Liam R. Howlett,
	Rick Edgecombe, Mark Brown, Kent Overstreet, Michal Hocko,
	linux-kernel, llvm

On architectures where ARCH_HAS_SYNC_CORE_BEFORE_USERMODE
is not selected, sync_core_before_usermode() is a no-op.
In membarrier_mm_sync_core_before_usermode() the compiler does not
eliminate redundant branches and load of mm->membarrier_state
for this case as the atomic_read() cannot be optimized away.

Here's a snippet of the code generated for finish_task_switch() on powerpc
prior to this change:

1b786c:   ld      r26,2624(r30)   # mm = rq->prev_mm;
.......
1b78c8:   cmpdi   cr7,r26,0
1b78cc:   beq     cr7,1b78e4 <finish_task_switch+0xd0>
1b78d0:   ld      r9,2312(r13)    # current
1b78d4:   ld      r9,1888(r9)     # current->mm
1b78d8:   cmpd    cr7,r26,r9
1b78dc:   beq     cr7,1b7a70 <finish_task_switch+0x25c>
1b78e0:   hwsync
1b78e4:   cmplwi  cr7,r27,128
.......
1b7a70:   lwz     r9,176(r26)     # atomic_read(&mm->membarrier_state)
1b7a74:   b       1b78e0 <finish_task_switch+0xcc>

This was found while analyzing "perf c2c" reports on kernels prior
to commit c1753fd02a00 ("mm: move mm_count into its own cache line")
where mm_count was false sharing with membarrier_state.

There is a minor improvement in the size of finish_task_switch().
The following are results from bloat-o-meter for ppc64le:

GCC 7.5.0
---------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch                           884     852     -32

GCC 12.2.1
----------
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-32 (-32)
Function                                     old     new   delta
finish_task_switch.isra                      852     820     -32

LLVM 17.0.6
-----------
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-36 (-36)
Function                                     old     new   delta
rt_mutex_schedule                            120     104     -16
finish_task_switch                           792     772     -20

Results on aarch64:

GCC 14.1.1
----------
add/remove: 0/2 grow/shrink: 1/1 up/down: 4/-60 (-56)
Function                                     old     new   delta
get_nohz_timer_target                        352     356      +4
e843419@0b02_0000d7e7_408                      8       -      -8
e843419@01bb_000021d2_868                      8       -      -8
finish_task_switch.isra                      592     548     -44

Signed-off-by: Nysal Jan K.A. <nysal@linux.ibm.com>
Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Reviewed-by: Michael Ellerman <mpe@ellerman.id.au>
Reviewed-by: Segher Boessenkool <segher@kernel.crashing.org>
---
V1 -> V2:
- Add results for aarch64
- Add a comment describing the changes
---
 include/linux/sched/mm.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/linux/sched/mm.h b/include/linux/sched/mm.h
index 928a626725e6..b13474825130 100644
--- a/include/linux/sched/mm.h
+++ b/include/linux/sched/mm.h
@@ -531,6 +531,13 @@ enum {
 
 static inline void membarrier_mm_sync_core_before_usermode(struct mm_struct *mm)
 {
+	/*
+	 * The atomic_read() below prevents CSE. The following should
+	 * help the compiler generate more efficient code on architectures
+	 * where sync_core_before_usermode() is a no-op.
+	 */
+	if (!IS_ENABLED(CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE))
+		return;
 	if (current->mm != mm)
 		return;
 	if (likely(!(atomic_read(&mm->membarrier_state) &
-- 
2.48.1


^ permalink raw reply related	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-03-03  6:06 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-07  5:39 [PATCH] sched/membarrier: Fix redundant load of membarrier_state Nysal Jan K.A.
2024-10-25  0:29 ` Michael Ellerman
2024-10-25  2:40   ` Stephen Rothwell
2024-10-25  3:22   ` Segher Boessenkool
2024-10-25 12:40   ` Mathieu Desnoyers
2024-10-25 18:30   ` Nysal Jan K.A.
2024-10-29  5:51 ` [PATCH v2] " Nysal Jan K.A.
2024-10-29 17:51   ` Mathieu Desnoyers
2024-10-29 23:29   ` Michael Ellerman
2024-10-30 13:33   ` Segher Boessenkool
2024-11-18  9:04   ` Michal Hocko
2024-11-18  9:25     ` Peter Zijlstra
2024-11-18  9:51       ` Michal Hocko
2025-01-09  8:46       ` Michal Hocko
2025-03-03  6:04   ` [PATCH v2 RESEND] " Nysal Jan K.A.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).