linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload cache is aged
@ 2025-07-31 16:10 Donet Tom
  2025-07-31 18:10 ` Ritesh Harjani
  0 siblings, 1 reply; 3+ messages in thread
From: Donet Tom @ 2025-07-31 16:10 UTC (permalink / raw)
  To: Madhavan Srinivasan, Christophe Leroy, linuxppc-dev
  Cc: Ritesh Harjani, linux-kernel, Michael Ellerman, Nicholas Piggin,
	Vishal Chourasia, Donet Tom

On systems using the hash MMU, there is a software SLB preload cache that
mirrors the entries loaded into the hardware SLB buffer. This preload
cache is subject to periodic eviction — typically after every 256 context
switches — to remove old entry.

Currently, the kernel skips the MMU context switch in switch_mm_irqs_off()
if the prev and next mm_struct are the same, as an optimization. However,
this behavior can lead to problems on hash MMU systems.

Consider the following scenario: a process is running on CPU A and gets
context-switched to CPU B. During this time, one of its SLB preload cache
entries is evicted. Later, the process is rescheduled on CPU A, which was
running swapper in the meantime, using the same mm_struct. Because
prev == next, the kernel skips the MMU context switch. As a result, the
hardware SLB buffer still contains the entry, but the software preload
cache does not.

The absence of the entry in the preload cache causes it to attempt to
reload the SLB. However, since the entry is already present in the hardware
SLB, this leads to a SLB multi-hit error.

To fix this issue, we add a code change to always switch the MMU context on
hash MMU if the SLB preload cache has aged. With this change, the
SLB multi-hit error no longer occurs.

Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
---
 arch/powerpc/mm/book3s64/slb.c | 2 +-
 arch/powerpc/mm/mmu_context.c  | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index 6b783552403c..08daac3f978c 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -509,7 +509,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 	 * SLB preload cache.
 	 */
 	tsk->thread.load_slb++;
-	if (!tsk->thread.load_slb) {
+	if (tsk->thread.load_slb == U8_MAX) {
 		unsigned long pc = KSTK_EIP(tsk);
 
 		preload_age(ti);
diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
index 3e3af29b4523..d7b9ac8c9971 100644
--- a/arch/powerpc/mm/mmu_context.c
+++ b/arch/powerpc/mm/mmu_context.c
@@ -84,7 +84,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	switch_mm_pgdir(tsk, next);
 
 	/* Nothing else to do if we aren't actually switching */
-	if (prev == next)
+	if ((prev == next) && (tsk->thread.load_slb != U8_MAX))
 		return;
 
 	/*
-- 
2.50.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload cache is aged
  2025-07-31 16:10 [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload cache is aged Donet Tom
@ 2025-07-31 18:10 ` Ritesh Harjani
  2025-08-01  2:18   ` Donet Tom
  0 siblings, 1 reply; 3+ messages in thread
From: Ritesh Harjani @ 2025-07-31 18:10 UTC (permalink / raw)
  To: Donet Tom
  Cc: linux-kernel, Michael Ellerman, Nicholas Piggin, Vishal Chourasia,
	Donet Tom, Madhavan Srinivasan, Christophe Leroy, linuxppc-dev

Donet Tom <donettom@linux.ibm.com> writes:

> On systems using the hash MMU, there is a software SLB preload cache that
> mirrors the entries loaded into the hardware SLB buffer. This preload
> cache is subject to periodic eviction — typically after every 256 context
> switches — to remove old entry.
>
> Currently, the kernel skips the MMU context switch in switch_mm_irqs_off()
> if the prev and next mm_struct are the same, as an optimization. However,
> this behavior can lead to problems on hash MMU systems.
>

Let's also add detailed flow of events, as this was not really an easy
problem to catch. 

CPU 0                                   CPU 1

Process P
exec                                    swapper/1
 load_elf_binary
  begin_new_exc
    activate_mm
     switch_mm_irqs_off 
      switch_mmu_context
       switch_slb
       /* 
        * This invalidates all the
        * entries in the HW and setup 
        * the new HW SLB entries as per 
        * the preload cache.
        */
switch_switch
sched_migrate_task migrates process P to cpu-1

Process swapper/0                       context switch (to process P)
(uses mm_struct of Process P)           switch_mm_irqs_off()
                                         switch_slb
                                           load_slb++
                                            /*
                                            * load_slb becomes 0 here
                                            * and we evict an entry from
                                            * the preload cache with
                                            * preload_age(). We still
                                            * keep HW SLB and preload
                                            * cache in sync, that is
                                            * because all HW SLB entries
                                            * anyways gets evicted in
                                            * switch_slb during SLBIA.
                                            * We then only add those
                                            * entries back in HW SLB,
                                            * which are currently
                                            * present in preload_cache
                                            * (after eviction).
                                            */
                                        load_elf_binary continues...
                                         setup_new_exec()
                                          slb_setup_new_exec()

                                        sched_switch event
                                        sched_migrate_task migrates 
                                        process P to cpu-0

context_switch from swapper/0 to Process P
 switch_mm_irqs_off()
  /*
   * Since both prev and next mm struct are same we don't call
   * switch_mmu_context(). This will cause the HW SLB and SW preload
   * cache to go out of sync in preload_new_slb_context. Because there
   * was an SLB entry which was evicted from both HW and preload cache
   * on cpu-1. Now later in preload_new_slb_context(), when we will try
   * to add the same preload entry again, we will add this to the SW
   * preload cache and then will add it to the HW SLB. Since on cpu-0
   * this entry was never invalidated, hence adding this entry to the HW
   * SLB will cause a SLB multi-hit error.
   */
load_elf_binary continues...
 START_THREAD
  start_thread
   preload_new_slb_context
   /*
    * This tries to add a new EA to preload cache which was earlier
    * evicted from both cpu-1 HW SLB and preload cache. This caused the
    * HW SLB of cpu-0 to go out of sync with the SW preload cache. The
    * reason for this was, that when we context switched back on CPU-0,
    * we should have ideally called switch_mmu_context() which will
    * bring bring the HW SLB entries on CPU-0 in sync with SW preload cache
    * entries by setting up the mmu context properly. But we didn't do
    * that since the prev mm_struct running on cpu-0 was same as the
    * next mm_struct (which is true for swapper / kernel threads). So
    * now when we try to add this new entry into the HW SLB of cpu-0,
    * we hit a SLB multi-hit error.
    */

WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62 assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
Modules linked in:
CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12 VOLUNTARY
Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
NIP:  c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
REGS: c0000000497c77e0 TRAP: 0700   Not tainted  (6.16.0-rc3-dirty)
MSR:  8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28888482  XER: 00000000
CFAR: c0000000001543b0 IRQMASK: 3
<...>
NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
LR [c0000000001543b4] slb_insert_entry+0x124/0x390
Call Trace:
  0x7fffceb5ffff (unreliable)
  preload_new_slb_context+0x100/0x1a0
  start_thread+0x26c/0x420
  load_elf_binary+0x1b04/0x1c40
  bprm_execve+0x358/0x680
  do_execveat_common+0x1f8/0x240
  sys_execve+0x58/0x70
  system_call_exception+0x114/0x300
  system_call_common+0x160/0x2c4


> Consider the following scenario: a process is running on CPU A and gets
> context-switched to CPU B. During this time, one of its SLB preload cache
> entries is evicted. Later, the process is rescheduled on CPU A, which was
> running swapper in the meantime, using the same mm_struct. Because
> prev == next, the kernel skips the MMU context switch. As a result, the
> hardware SLB buffer still contains the entry, but the software preload
> cache does not.
> 
> The absence of the entry in the preload cache causes it to attempt to
> reload the SLB. However, since the entry is already present in the hardware
> SLB, this leads to a SLB multi-hit error.
>

Can we use the detailed commit msg from above instead of above two paragraphs.
It is easier to visualize and document if we have it that way.


> To fix this issue, we add a code change to always switch the MMU context on
> hash MMU if the SLB preload cache has aged. With this change, the
> SLB multi-hit error no longer occurs.
>
> Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")

CC: stable@vger.kernel.org

Otherwise LGTM.

> Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
> ---
>  arch/powerpc/mm/book3s64/slb.c | 2 +-
>  arch/powerpc/mm/mmu_context.c  | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> index 6b783552403c..08daac3f978c 100644
> --- a/arch/powerpc/mm/book3s64/slb.c
> +++ b/arch/powerpc/mm/book3s64/slb.c
> @@ -509,7 +509,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>  	 * SLB preload cache.
>  	 */
>  	tsk->thread.load_slb++;
> -	if (!tsk->thread.load_slb) {
> +	if (tsk->thread.load_slb == U8_MAX) {
>  		unsigned long pc = KSTK_EIP(tsk);
>  
>  		preload_age(ti);
> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
> index 3e3af29b4523..d7b9ac8c9971 100644
> --- a/arch/powerpc/mm/mmu_context.c
> +++ b/arch/powerpc/mm/mmu_context.c
> @@ -84,7 +84,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>  	switch_mm_pgdir(tsk, next);
>  
>  	/* Nothing else to do if we aren't actually switching */
> -	if (prev == next)
> +	if ((prev == next) && (tsk->thread.load_slb != U8_MAX))
>  		return;
>  
>  	/*
> -- 
> 2.50.1

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload cache is aged
  2025-07-31 18:10 ` Ritesh Harjani
@ 2025-08-01  2:18   ` Donet Tom
  0 siblings, 0 replies; 3+ messages in thread
From: Donet Tom @ 2025-08-01  2:18 UTC (permalink / raw)
  To: Ritesh Harjani (IBM)
  Cc: linux-kernel, Michael Ellerman, Nicholas Piggin, Vishal Chourasia,
	Madhavan Srinivasan, Christophe Leroy, linuxppc-dev

Hi Ritesh

On 7/31/25 11:40 PM, Ritesh Harjani (IBM) wrote:
> Donet Tom <donettom@linux.ibm.com> writes:
>
>> On systems using the hash MMU, there is a software SLB preload cache that
>> mirrors the entries loaded into the hardware SLB buffer. This preload
>> cache is subject to periodic eviction — typically after every 256 context
>> switches — to remove old entry.
>>
>> Currently, the kernel skips the MMU context switch in switch_mm_irqs_off()
>> if the prev and next mm_struct are the same, as an optimization. However,
>> this behavior can lead to problems on hash MMU systems.
>>
> Let's also add detailed flow of events, as this was not really an easy
> problem to catch.
>
> CPU 0                                   CPU 1
>
> Process P
> exec                                    swapper/1
>   load_elf_binary
>    begin_new_exc
>      activate_mm
>       switch_mm_irqs_off
>        switch_mmu_context
>         switch_slb
>         /*
>          * This invalidates all the
>          * entries in the HW and setup
>          * the new HW SLB entries as per
>          * the preload cache.
>          */
> switch_switch
> sched_migrate_task migrates process P to cpu-1
>
> Process swapper/0                       context switch (to process P)
> (uses mm_struct of Process P)           switch_mm_irqs_off()
>                                           switch_slb
>                                             load_slb++
>                                              /*
>                                              * load_slb becomes 0 here
>                                              * and we evict an entry from
>                                              * the preload cache with
>                                              * preload_age(). We still
>                                              * keep HW SLB and preload
>                                              * cache in sync, that is
>                                              * because all HW SLB entries
>                                              * anyways gets evicted in
>                                              * switch_slb during SLBIA.
>                                              * We then only add those
>                                              * entries back in HW SLB,
>                                              * which are currently
>                                              * present in preload_cache
>                                              * (after eviction).
>                                              */
>                                          load_elf_binary continues...
>                                           setup_new_exec()
>                                            slb_setup_new_exec()
>
>                                          sched_switch event
>                                          sched_migrate_task migrates
>                                          process P to cpu-0
>
> context_switch from swapper/0 to Process P
>   switch_mm_irqs_off()
>    /*
>     * Since both prev and next mm struct are same we don't call
>     * switch_mmu_context(). This will cause the HW SLB and SW preload
>     * cache to go out of sync in preload_new_slb_context. Because there
>     * was an SLB entry which was evicted from both HW and preload cache
>     * on cpu-1. Now later in preload_new_slb_context(), when we will try
>     * to add the same preload entry again, we will add this to the SW
>     * preload cache and then will add it to the HW SLB. Since on cpu-0
>     * this entry was never invalidated, hence adding this entry to the HW
>     * SLB will cause a SLB multi-hit error.
>     */
> load_elf_binary continues...
>   START_THREAD
>    start_thread
>     preload_new_slb_context
>     /*
>      * This tries to add a new EA to preload cache which was earlier
>      * evicted from both cpu-1 HW SLB and preload cache. This caused the
>      * HW SLB of cpu-0 to go out of sync with the SW preload cache. The
>      * reason for this was, that when we context switched back on CPU-0,
>      * we should have ideally called switch_mmu_context() which will
>      * bring bring the HW SLB entries on CPU-0 in sync with SW preload cache
>      * entries by setting up the mmu context properly. But we didn't do
>      * that since the prev mm_struct running on cpu-0 was same as the
>      * next mm_struct (which is true for swapper / kernel threads). So
>      * now when we try to add this new entry into the HW SLB of cpu-0,
>      * we hit a SLB multi-hit error.
>      */
>
> WARNING: CPU: 0 PID: 1810970 at arch/powerpc/mm/book3s64/slb.c:62 assert_slb_presence+0x2c/0x50(48 results) 02:47:29 [20157/42149]
> Modules linked in:
> CPU: 0 UID: 0 PID: 1810970 Comm: dd Not tainted 6.16.0-rc3-dirty #12 VOLUNTARY
> Hardware name: IBM pSeries (emulated by qemu) POWER8 (architected) 0x4d0200 0xf000004 of:SLOF,HEAD hv:linux,kvm pSeries
> NIP:  c00000000015426c LR: c0000000001543b4 CTR: 0000000000000000
> REGS: c0000000497c77e0 TRAP: 0700   Not tainted  (6.16.0-rc3-dirty)
> MSR:  8000000002823033 <SF,VEC,VSX,FP,ME,IR,DR,RI,LE>  CR: 28888482  XER: 00000000
> CFAR: c0000000001543b0 IRQMASK: 3
> <...>
> NIP [c00000000015426c] assert_slb_presence+0x2c/0x50
> LR [c0000000001543b4] slb_insert_entry+0x124/0x390
> Call Trace:
>    0x7fffceb5ffff (unreliable)
>    preload_new_slb_context+0x100/0x1a0
>    start_thread+0x26c/0x420
>    load_elf_binary+0x1b04/0x1c40
>    bprm_execve+0x358/0x680
>    do_execveat_common+0x1f8/0x240
>    sys_execve+0x58/0x70
>    system_call_exception+0x114/0x300
>    system_call_common+0x160/0x2c4
>
>
>> Consider the following scenario: a process is running on CPU A and gets
>> context-switched to CPU B. During this time, one of its SLB preload cache
>> entries is evicted. Later, the process is rescheduled on CPU A, which was
>> running swapper in the meantime, using the same mm_struct. Because
>> prev == next, the kernel skips the MMU context switch. As a result, the
>> hardware SLB buffer still contains the entry, but the software preload
>> cache does not.
>>
>> The absence of the entry in the preload cache causes it to attempt to
>> reload the SLB. However, since the entry is already present in the hardware
>> SLB, this leads to a SLB multi-hit error.
>>
> Can we use the detailed commit msg from above instead of above two paragraphs.
> It is easier to visualize and document if we have it that way.


Yes, that makes sense — I’ll add this and send V2


>
>
>> To fix this issue, we add a code change to always switch the MMU context on
>> hash MMU if the SLB preload cache has aged. With this change, the
>> SLB multi-hit error no longer occurs.
>>
>> Fixes: 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
> CC: stable@vger.kernel.org
>
> Otherwise LGTM.


Thank you.


>
>> Suggested-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>
>> Signed-off-by: Donet Tom <donettom@linux.ibm.com>
>> ---
>>   arch/powerpc/mm/book3s64/slb.c | 2 +-
>>   arch/powerpc/mm/mmu_context.c  | 2 +-
>>   2 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
>> index 6b783552403c..08daac3f978c 100644
>> --- a/arch/powerpc/mm/book3s64/slb.c
>> +++ b/arch/powerpc/mm/book3s64/slb.c
>> @@ -509,7 +509,7 @@ void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
>>   	 * SLB preload cache.
>>   	 */
>>   	tsk->thread.load_slb++;
>> -	if (!tsk->thread.load_slb) {
>> +	if (tsk->thread.load_slb == U8_MAX) {
>>   		unsigned long pc = KSTK_EIP(tsk);
>>   
>>   		preload_age(ti);
>> diff --git a/arch/powerpc/mm/mmu_context.c b/arch/powerpc/mm/mmu_context.c
>> index 3e3af29b4523..d7b9ac8c9971 100644
>> --- a/arch/powerpc/mm/mmu_context.c
>> +++ b/arch/powerpc/mm/mmu_context.c
>> @@ -84,7 +84,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
>>   	switch_mm_pgdir(tsk, next);
>>   
>>   	/* Nothing else to do if we aren't actually switching */
>> -	if (prev == next)
>> +	if ((prev == next) && (tsk->thread.load_slb != U8_MAX))
>>   		return;
>>   
>>   	/*
>> -- 
>> 2.50.1

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2025-08-01  2:18 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-31 16:10 [PATCH] powerpc/mm: Switch MMU context on hash MMU if SLB preload cache is aged Donet Tom
2025-07-31 18:10 ` Ritesh Harjani
2025-08-01  2:18   ` Donet Tom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).