[PATCH] riscv: mm: Fixup spurious fault of kernel vaddr

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr
@ 2023-07-21 14:51 guoren
  2023-07-21 15:18 ` Alexandre Ghiti
  0 siblings, 1 reply; 8+ messages in thread
From: guoren @ 2023-07-21 14:51 UTC (permalink / raw)
  To: guoren, palmer, paul.walmsley, falcon, bjorn, conor.dooley, alex
  Cc: linux-arch, linux-kernel, linux-riscv, Guo Ren

From: Guo Ren <guoren@linux.alibaba.com>

RISC-V specification permits the caching of PTEs whose V (Valid)
bit is clear. Operating systems must be written to cope with this
possibility, but implementers are reminded that eagerly caching
invalid PTEs will reduce performance by causing additional page
faults.

So we must keep vmalloc_fault for the spurious page faults of kernel
virtual address from an OoO machine.

Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
Signed-off-by: Guo Ren <guoren@kernel.org>
---
 arch/riscv/mm/fault.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 85165fe438d8..f662c9eae7d4 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
 	 * only copy the information from the master page table,
 	 * nothing more.
 	 */
-	if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
-	    unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
+	if (unlikely(addr >= TASK_SIZE)) {
 		vmalloc_fault(regs, code, addr);
 		return;
 	}
-- 
2.36.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr
  2023-07-21 14:51 [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr guoren
@ 2023-07-21 15:18 ` Alexandre Ghiti
  2023-07-21 16:08   ` Guo Ren
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandre Ghiti @ 2023-07-21 15:18 UTC (permalink / raw)
  To: guoren, palmer, paul.walmsley, falcon, bjorn, conor.dooley
  Cc: linux-arch, linux-kernel, linux-riscv, Guo Ren


On 21/07/2023 16:51, guoren@kernel.org wrote:
> From: Guo Ren <guoren@linux.alibaba.com>
>
> RISC-V specification permits the caching of PTEs whose V (Valid)
> bit is clear. Operating systems must be written to cope with this
> possibility, but implementers are reminded that eagerly caching
> invalid PTEs will reduce performance by causing additional page
> faults.
>
> So we must keep vmalloc_fault for the spurious page faults of kernel
> virtual address from an OoO machine.
>
> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> Signed-off-by: Guo Ren <guoren@kernel.org>
> ---
>   arch/riscv/mm/fault.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> index 85165fe438d8..f662c9eae7d4 100644
> --- a/arch/riscv/mm/fault.c
> +++ b/arch/riscv/mm/fault.c
> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
>   	 * only copy the information from the master page table,
>   	 * nothing more.
>   	 */
> -	if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
> -	    unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
> +	if (unlikely(addr >= TASK_SIZE)) {
>   		vmalloc_fault(regs, code, addr);
>   		return;
>   	}


Can you share what you are trying to fix here?

I have a fix (that's currently running our CI) for commit 7d3332be011e 
("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that 
implements flush_cache_vmap() since we lost the vmalloc_fault.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr
  2023-07-21 15:18 ` Alexandre Ghiti
@ 2023-07-21 16:08   ` Guo Ren
  2023-07-21 20:00     ` Alexandre Ghiti
  0 siblings, 1 reply; 8+ messages in thread
From: Guo Ren @ 2023-07-21 16:08 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: palmer, paul.walmsley, falcon, bjorn, conor.dooley, linux-arch,
	linux-kernel, linux-riscv, Guo Ren

On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>
> On 21/07/2023 16:51, guoren@kernel.org wrote:
> > From: Guo Ren <guoren@linux.alibaba.com>
> >
> > RISC-V specification permits the caching of PTEs whose V (Valid)
> > bit is clear. Operating systems must be written to cope with this
> > possibility, but implementers are reminded that eagerly caching
> > invalid PTEs will reduce performance by causing additional page
> > faults.
> >
> > So we must keep vmalloc_fault for the spurious page faults of kernel
> > virtual address from an OoO machine.
> >
> > Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> > Signed-off-by: Guo Ren <guoren@kernel.org>
> > ---
> >   arch/riscv/mm/fault.c | 3 +--
> >   1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> > index 85165fe438d8..f662c9eae7d4 100644
> > --- a/arch/riscv/mm/fault.c
> > +++ b/arch/riscv/mm/fault.c
> > @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
> >        * only copy the information from the master page table,
> >        * nothing more.
> >        */
> > -     if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
> > -         unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
> > +     if (unlikely(addr >= TASK_SIZE)) {
> >               vmalloc_fault(regs, code, addr);
> >               return;
> >       }
>
>
> Can you share what you are trying to fix here?
We met a spurious page fault panic on an OoO machine.

1. The processor speculative execution brings the V=0 entries into the
TLB in the kernel virtual address.
2. Linux kernel installs the kernel virtual address with the page, and V=1
3. When kernel code access the kernel virtual address, it would raise
a page fault as the V=0 entry in the tlb.
4. No vmalloc_fault, then panic.

>
> I have a fix (that's currently running our CI) for commit 7d3332be011e
> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that
> implements flush_cache_vmap() since we lost the vmalloc_fault.
Could you share that patch?

>


-- 
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr
  2023-07-21 16:08   ` Guo Ren
@ 2023-07-21 20:00     ` Alexandre Ghiti
  2023-07-21 23:59       ` Guo Ren
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandre Ghiti @ 2023-07-21 20:00 UTC (permalink / raw)
  To: Guo Ren
  Cc: palmer, paul.walmsley, falcon, bjorn, conor.dooley, linux-arch,
	linux-kernel, linux-riscv, Guo Ren


On 21/07/2023 18:08, Guo Ren wrote:
> On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>
>> On 21/07/2023 16:51, guoren@kernel.org wrote:
>>> From: Guo Ren <guoren@linux.alibaba.com>
>>>
>>> RISC-V specification permits the caching of PTEs whose V (Valid)
>>> bit is clear. Operating systems must be written to cope with this
>>> possibility, but implementers are reminded that eagerly caching
>>> invalid PTEs will reduce performance by causing additional page
>>> faults.
>>>
>>> So we must keep vmalloc_fault for the spurious page faults of kernel
>>> virtual address from an OoO machine.
>>>
>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
>>> Signed-off-by: Guo Ren <guoren@kernel.org>
>>> ---
>>>    arch/riscv/mm/fault.c | 3 +--
>>>    1 file changed, 1 insertion(+), 2 deletions(-)
>>>
>>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
>>> index 85165fe438d8..f662c9eae7d4 100644
>>> --- a/arch/riscv/mm/fault.c
>>> +++ b/arch/riscv/mm/fault.c
>>> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
>>>         * only copy the information from the master page table,
>>>         * nothing more.
>>>         */
>>> -     if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
>>> -         unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
>>> +     if (unlikely(addr >= TASK_SIZE)) {
>>>                vmalloc_fault(regs, code, addr);
>>>                return;
>>>        }
>>
>> Can you share what you are trying to fix here?
> We met a spurious page fault panic on an OoO machine.
>
> 1. The processor speculative execution brings the V=0 entries into the
> TLB in the kernel virtual address.
> 2. Linux kernel installs the kernel virtual address with the page, and V=1
> 3. When kernel code access the kernel virtual address, it would raise
> a page fault as the V=0 entry in the tlb.
> 4. No vmalloc_fault, then panic.
>
>> I have a fix (that's currently running our CI) for commit 7d3332be011e
>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that
>> implements flush_cache_vmap() since we lost the vmalloc_fault.
> Could you share that patch?


Here we go:


Author: Alexandre Ghiti <alexghiti@rivosinc.com>
Date:   Fri Jul 21 08:43:44 2023 +0000

     riscv: Implement flush_cache_vmap()

     The RISC-V kernel needs a sfence.vma after a page table 
modification: we
     used to rely on the vmalloc fault handling to emit an sfence.vma, but
     commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
     vmalloc/modules area") got rid of this path for 64-bit kernels, so 
now we
     need to explicitly emit a sfence.vma in flush_cache_vmap().

     Note that we don't need to implement flush_cache_vunmap() as the 
generic
     code should emit a flush tlb after unmapping a vmalloc region.

     Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for 
vmalloc/modules area")
     Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>

diff --git a/arch/riscv/include/asm/cacheflush.h 
b/arch/riscv/include/asm/cacheflush.h
index 8091b8bf4883..b93ffddf8a61 100644
--- a/arch/riscv/include/asm/cacheflush.h
+++ b/arch/riscv/include/asm/cacheflush.h
@@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
  #define flush_icache_user_page(vma, pg, addr, len) \
         flush_icache_mm(vma->vm_mm, 0)

+#ifdef CONFIG_64BIT
+#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
+#endif
+
  #ifndef CONFIG_SMP

  #define flush_icache_all() local_flush_icache_all()


Let me know if that works for you!


>
>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr
  2023-07-21 20:00     ` Alexandre Ghiti
@ 2023-07-21 23:59       ` Guo Ren
  2023-07-24  9:05         ` Alexandre Ghiti
  0 siblings, 1 reply; 8+ messages in thread
From: Guo Ren @ 2023-07-21 23:59 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: palmer, paul.walmsley, falcon, bjorn, conor.dooley, linux-arch,
	linux-kernel, linux-riscv, Guo Ren

On Fri, Jul 21, 2023 at 4:01 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>
> On 21/07/2023 18:08, Guo Ren wrote:
> > On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >>
> >> On 21/07/2023 16:51, guoren@kernel.org wrote:
> >>> From: Guo Ren <guoren@linux.alibaba.com>
> >>>
> >>> RISC-V specification permits the caching of PTEs whose V (Valid)
> >>> bit is clear. Operating systems must be written to cope with this
> >>> possibility, but implementers are reminded that eagerly caching
> >>> invalid PTEs will reduce performance by causing additional page
> >>> faults.
> >>>
> >>> So we must keep vmalloc_fault for the spurious page faults of kernel
> >>> virtual address from an OoO machine.
> >>>
> >>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> >>> Signed-off-by: Guo Ren <guoren@kernel.org>
> >>> ---
> >>>    arch/riscv/mm/fault.c | 3 +--
> >>>    1 file changed, 1 insertion(+), 2 deletions(-)
> >>>
> >>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> >>> index 85165fe438d8..f662c9eae7d4 100644
> >>> --- a/arch/riscv/mm/fault.c
> >>> +++ b/arch/riscv/mm/fault.c
> >>> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
> >>>         * only copy the information from the master page table,
> >>>         * nothing more.
> >>>         */
> >>> -     if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
> >>> -         unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
> >>> +     if (unlikely(addr >= TASK_SIZE)) {
> >>>                vmalloc_fault(regs, code, addr);
> >>>                return;
> >>>        }
> >>
> >> Can you share what you are trying to fix here?
> > We met a spurious page fault panic on an OoO machine.
> >
> > 1. The processor speculative execution brings the V=0 entries into the
> > TLB in the kernel virtual address.
> > 2. Linux kernel installs the kernel virtual address with the page, and V=1
> > 3. When kernel code access the kernel virtual address, it would raise
> > a page fault as the V=0 entry in the tlb.
> > 4. No vmalloc_fault, then panic.
> >
> >> I have a fix (that's currently running our CI) for commit 7d3332be011e
> >> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that
> >> implements flush_cache_vmap() since we lost the vmalloc_fault.
> > Could you share that patch?
>
>
> Here we go:
>
>
> Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> Date:   Fri Jul 21 08:43:44 2023 +0000
>
>      riscv: Implement flush_cache_vmap()
>
>      The RISC-V kernel needs a sfence.vma after a page table
> modification: we
>      used to rely on the vmalloc fault handling to emit an sfence.vma, but
>      commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>      vmalloc/modules area") got rid of this path for 64-bit kernels, so
> now we
>      need to explicitly emit a sfence.vma in flush_cache_vmap().
>
>      Note that we don't need to implement flush_cache_vunmap() as the
> generic
>      code should emit a flush tlb after unmapping a vmalloc region.
>
>      Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> vmalloc/modules area")
>      Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
>
> diff --git a/arch/riscv/include/asm/cacheflush.h
> b/arch/riscv/include/asm/cacheflush.h
> index 8091b8bf4883..b93ffddf8a61 100644
> --- a/arch/riscv/include/asm/cacheflush.h
> +++ b/arch/riscv/include/asm/cacheflush.h
> @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
>   #define flush_icache_user_page(vma, pg, addr, len) \
>          flush_icache_mm(vma->vm_mm, 0)
>
> +#ifdef CONFIG_64BIT
> +#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
> +#endif
I don't want that, and flush_tlb_kernel_range is flush_tlb_all. In
addition, it would call IPI, which is a performance killer.

What's the problem of spurious fault replay? It only costs a
local_tlb_flush with vaddr.

> +
>   #ifndef CONFIG_SMP
>
>   #define flush_icache_all() local_flush_icache_all()
>
>
> Let me know if that works for you!
>
>
> >
> >

--
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr
  2023-07-21 23:59       ` Guo Ren
@ 2023-07-24  9:05         ` Alexandre Ghiti
  2023-07-29  6:40           ` Guo Ren
  0 siblings, 1 reply; 8+ messages in thread
From: Alexandre Ghiti @ 2023-07-24  9:05 UTC (permalink / raw)
  To: Guo Ren
  Cc: palmer, paul.walmsley, falcon, bjorn, conor.dooley, linux-arch,
	linux-kernel, linux-riscv, Guo Ren


On 22/07/2023 01:59, Guo Ren wrote:
> On Fri, Jul 21, 2023 at 4:01 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>
>> On 21/07/2023 18:08, Guo Ren wrote:
>>> On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>> On 21/07/2023 16:51, guoren@kernel.org wrote:
>>>>> From: Guo Ren <guoren@linux.alibaba.com>
>>>>>
>>>>> RISC-V specification permits the caching of PTEs whose V (Valid)
>>>>> bit is clear. Operating systems must be written to cope with this
>>>>> possibility, but implementers are reminded that eagerly caching
>>>>> invalid PTEs will reduce performance by causing additional page
>>>>> faults.
>>>>>
>>>>> So we must keep vmalloc_fault for the spurious page faults of kernel
>>>>> virtual address from an OoO machine.
>>>>>
>>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
>>>>> Signed-off-by: Guo Ren <guoren@kernel.org>
>>>>> ---
>>>>>     arch/riscv/mm/fault.c | 3 +--
>>>>>     1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
>>>>> index 85165fe438d8..f662c9eae7d4 100644
>>>>> --- a/arch/riscv/mm/fault.c
>>>>> +++ b/arch/riscv/mm/fault.c
>>>>> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
>>>>>          * only copy the information from the master page table,
>>>>>          * nothing more.
>>>>>          */
>>>>> -     if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
>>>>> -         unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
>>>>> +     if (unlikely(addr >= TASK_SIZE)) {
>>>>>                 vmalloc_fault(regs, code, addr);
>>>>>                 return;
>>>>>         }
>>>> Can you share what you are trying to fix here?
>>> We met a spurious page fault panic on an OoO machine.
>>>
>>> 1. The processor speculative execution brings the V=0 entries into the
>>> TLB in the kernel virtual address.
>>> 2. Linux kernel installs the kernel virtual address with the page, and V=1
>>> 3. When kernel code access the kernel virtual address, it would raise
>>> a page fault as the V=0 entry in the tlb.
>>> 4. No vmalloc_fault, then panic.
>>>
>>>> I have a fix (that's currently running our CI) for commit 7d3332be011e
>>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that
>>>> implements flush_cache_vmap() since we lost the vmalloc_fault.
>>> Could you share that patch?
>>
>> Here we go:
>>
>>
>> Author: Alexandre Ghiti <alexghiti@rivosinc.com>
>> Date:   Fri Jul 21 08:43:44 2023 +0000
>>
>>       riscv: Implement flush_cache_vmap()
>>
>>       The RISC-V kernel needs a sfence.vma after a page table
>> modification: we
>>       used to rely on the vmalloc fault handling to emit an sfence.vma, but
>>       commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>>       vmalloc/modules area") got rid of this path for 64-bit kernels, so
>> now we
>>       need to explicitly emit a sfence.vma in flush_cache_vmap().
>>
>>       Note that we don't need to implement flush_cache_vunmap() as the
>> generic
>>       code should emit a flush tlb after unmapping a vmalloc region.
>>
>>       Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>> vmalloc/modules area")
>>       Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
>>
>> diff --git a/arch/riscv/include/asm/cacheflush.h
>> b/arch/riscv/include/asm/cacheflush.h
>> index 8091b8bf4883..b93ffddf8a61 100644
>> --- a/arch/riscv/include/asm/cacheflush.h
>> +++ b/arch/riscv/include/asm/cacheflush.h
>> @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
>>    #define flush_icache_user_page(vma, pg, addr, len) \
>>           flush_icache_mm(vma->vm_mm, 0)
>>
>> +#ifdef CONFIG_64BIT
>> +#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
>> +#endif
> I don't want that, and flush_tlb_kernel_range is flush_tlb_all. In
> addition, it would call IPI, which is a performance killer.


At the moment, flush_tlb_kernel_range() indeed calls flush_tlb_all() but 
that needs to be fixed, see my last patchset 
https://lore.kernel.org/linux-riscv/20230711075434.10936-1-alexghiti@rivosinc.com/.

But can you at least check that this fixes your issue? It would be 
interesting to see if the problem comes from vmalloc or something else.


> What's the problem of spurious fault replay? It only costs a
> local_tlb_flush with vaddr.


We had this exact discussion internally this week, and the fault replay 
seems like a solution. But that needs to be thought carefully: the 
vmalloc fault was removed for a reason (see Bjorn commit log), tracing 
functions can use vmalloc() in the path of the vmalloc fault, causing an 
infinite trap loop. And here you are simply re-enabling this problem. In 
addition, this patch makes vmalloc_fault() catch *all* kernel faults in 
the kernel address space, so any genuine kernel fault would loop forever 
in vmalloc_fault().


For now, the simplest solution is to implement flush_cache_vmap() 
because riscv needs a sfence.vma when adding a new mapping, and if 
that's a "performance killer", let's measure that and implement 
something like this patch is trying to do. I may be wrong, but there 
aren't many new kernel mappings that would require a call to 
flush_cache_vmap() so I disagree with the performance killer argument, 
but happy to be proven otherwise!

Thanks,

Alex


>
>> +
>>    #ifndef CONFIG_SMP
>>
>>    #define flush_icache_all() local_flush_icache_all()
>>
>>
>> Let me know if that works for you!
>>
>>
>>>
> --
> Best Regards
>   Guo Ren

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr
  2023-07-24  9:05         ` Alexandre Ghiti
@ 2023-07-29  6:40           ` Guo Ren
  2023-07-31  9:43             ` Alexandre Ghiti
  0 siblings, 1 reply; 8+ messages in thread
From: Guo Ren @ 2023-07-29  6:40 UTC (permalink / raw)
  To: Alexandre Ghiti
  Cc: palmer, paul.walmsley, falcon, bjorn, conor.dooley, linux-arch,
	linux-kernel, linux-riscv, Guo Ren

Sorry for the late reply, Alexandre. I'm busy with other suffs.

On Mon, Jul 24, 2023 at 5:05 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>
>
> On 22/07/2023 01:59, Guo Ren wrote:
> > On Fri, Jul 21, 2023 at 4:01 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >>
> >> On 21/07/2023 18:08, Guo Ren wrote:
> >>> On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
> >>>> On 21/07/2023 16:51, guoren@kernel.org wrote:
> >>>>> From: Guo Ren <guoren@linux.alibaba.com>
> >>>>>
> >>>>> RISC-V specification permits the caching of PTEs whose V (Valid)
> >>>>> bit is clear. Operating systems must be written to cope with this
> >>>>> possibility, but implementers are reminded that eagerly caching
> >>>>> invalid PTEs will reduce performance by causing additional page
> >>>>> faults.
> >>>>>
> >>>>> So we must keep vmalloc_fault for the spurious page faults of kernel
> >>>>> virtual address from an OoO machine.
> >>>>>
> >>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
> >>>>> Signed-off-by: Guo Ren <guoren@kernel.org>
> >>>>> ---
> >>>>>     arch/riscv/mm/fault.c | 3 +--
> >>>>>     1 file changed, 1 insertion(+), 2 deletions(-)
> >>>>>
> >>>>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
> >>>>> index 85165fe438d8..f662c9eae7d4 100644
> >>>>> --- a/arch/riscv/mm/fault.c
> >>>>> +++ b/arch/riscv/mm/fault.c
> >>>>> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
> >>>>>          * only copy the information from the master page table,
> >>>>>          * nothing more.
> >>>>>          */
> >>>>> -     if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
> >>>>> -         unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
> >>>>> +     if (unlikely(addr >= TASK_SIZE)) {
> >>>>>                 vmalloc_fault(regs, code, addr);
> >>>>>                 return;
> >>>>>         }
> >>>> Can you share what you are trying to fix here?
> >>> We met a spurious page fault panic on an OoO machine.
> >>>
> >>> 1. The processor speculative execution brings the V=0 entries into the
> >>> TLB in the kernel virtual address.
> >>> 2. Linux kernel installs the kernel virtual address with the page, and V=1
> >>> 3. When kernel code access the kernel virtual address, it would raise
> >>> a page fault as the V=0 entry in the tlb.
> >>> 4. No vmalloc_fault, then panic.
> >>>
> >>>> I have a fix (that's currently running our CI) for commit 7d3332be011e
> >>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that
> >>>> implements flush_cache_vmap() since we lost the vmalloc_fault.
> >>> Could you share that patch?
> >>
> >> Here we go:
> >>
> >>
> >> Author: Alexandre Ghiti <alexghiti@rivosinc.com>
> >> Date:   Fri Jul 21 08:43:44 2023 +0000
> >>
> >>       riscv: Implement flush_cache_vmap()
> >>
> >>       The RISC-V kernel needs a sfence.vma after a page table
> >> modification: we
> >>       used to rely on the vmalloc fault handling to emit an sfence.vma, but
> >>       commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> >>       vmalloc/modules area") got rid of this path for 64-bit kernels, so
> >> now we
> >>       need to explicitly emit a sfence.vma in flush_cache_vmap().
> >>
> >>       Note that we don't need to implement flush_cache_vunmap() as the
> >> generic
> >>       code should emit a flush tlb after unmapping a vmalloc region.
> >>
> >>       Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> >> vmalloc/modules area")
> >>       Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> >>
> >> diff --git a/arch/riscv/include/asm/cacheflush.h
> >> b/arch/riscv/include/asm/cacheflush.h
> >> index 8091b8bf4883..b93ffddf8a61 100644
> >> --- a/arch/riscv/include/asm/cacheflush.h
> >> +++ b/arch/riscv/include/asm/cacheflush.h
> >> @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> >>    #define flush_icache_user_page(vma, pg, addr, len) \
> >>           flush_icache_mm(vma->vm_mm, 0)
> >>
> >> +#ifdef CONFIG_64BIT
> >> +#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
> >> +#endif
> > I don't want that, and flush_tlb_kernel_range is flush_tlb_all. In
> > addition, it would call IPI, which is a performance killer.
>
>
> At the moment, flush_tlb_kernel_range() indeed calls flush_tlb_all() but
> that needs to be fixed, see my last patchset
> https://lore.kernel.org/linux-riscv/20230711075434.10936-1-alexghiti@rivosinc.com/.
>
> But can you at least check that this fixes your issue? It would be
> interesting to see if the problem comes from vmalloc or something else.
It could solve my issue.

>
>
> > What's the problem of spurious fault replay? It only costs a
> > local_tlb_flush with vaddr.
>
>
> We had this exact discussion internally this week, and the fault replay
> seems like a solution. But that needs to be thought carefully: the
> vmalloc fault was removed for a reason (see Bjorn commit log), tracing
> functions can use vmalloc() in the path of the vmalloc fault, causing an
> infinite trap loop. And here you are simply re-enabling this problem.
Thx for mentioning it, and I will solve it in the next version of the patch:

-static inline void vmalloc_fault(struct pt_regs *regs, int code,
unsigned long addr)
+static void notrace vmalloc_fault(struct pt_regs *regs, int code,
unsigned long addr)

> In
> addition, this patch makes vmalloc_fault() catch *all* kernel faults in
> the kernel address space, so any genuine kernel fault would loop forever
> in vmalloc_fault().
We check whether kernel vaddr is valid by the page_table, not range.
I'm sure "the any genuine kernel fault would loop forever in
vmalloc_fault()" is about what? Could you give an example?

>
>
> For now, the simplest solution is to implement flush_cache_vmap()
> because riscv needs a sfence.vma when adding a new mapping, and if
It's not a local_tlb_flush, and it would ipi broadcast all harts.
on_each_cpu(__ipi_flush_tlb_all, NULL, 1);

That's too horrible.

Some custom drivers or test codes may care about it.

> that's a "performance killer", let's measure that and implement
> something like this patch is trying to do. I may be wrong, but there
> aren't many new kernel mappings that would require a call to
> flush_cache_vmap() so I disagree with the performance killer argument,
> but happy to be proven otherwise!

1. I agree to pre-allocate pgd entries. It's good for performance, but
don't do that when Sv32.
2. We still need vmalloc_fault to match ISA spec requirements. (Some
vendors' microarchitectures (e.g., T-HEAD c910) could prevent V=0 into
TLB when PTW, then they don't need it.)
3. Only when vmalloc_fault can't solve the problem, then let's think
about the flush_cache_vmap() solution.

>
> Thanks,
>
> Alex
>
>
> >
> >> +
> >>    #ifndef CONFIG_SMP
> >>
> >>    #define flush_icache_all() local_flush_icache_all()
> >>
> >>
> >> Let me know if that works for you!
> >>
> >>
> >>>
> > --
> > Best Regards
> >   Guo Ren



--
Best Regards
 Guo Ren

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr
  2023-07-29  6:40           ` Guo Ren
@ 2023-07-31  9:43             ` Alexandre Ghiti
  0 siblings, 0 replies; 8+ messages in thread
From: Alexandre Ghiti @ 2023-07-31  9:43 UTC (permalink / raw)
  To: Guo Ren
  Cc: palmer, paul.walmsley, falcon, bjorn, conor.dooley, linux-arch,
	linux-kernel, linux-riscv, Guo Ren


On 29/07/2023 08:40, Guo Ren wrote:
> Sorry for the late reply, Alexandre. I'm busy with other suffs.
>
> On Mon, Jul 24, 2023 at 5:05 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>
>> On 22/07/2023 01:59, Guo Ren wrote:
>>> On Fri, Jul 21, 2023 at 4:01 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>> On 21/07/2023 18:08, Guo Ren wrote:
>>>>> On Fri, Jul 21, 2023 at 11:19 PM Alexandre Ghiti <alex@ghiti.fr> wrote:
>>>>>> On 21/07/2023 16:51, guoren@kernel.org wrote:
>>>>>>> From: Guo Ren <guoren@linux.alibaba.com>
>>>>>>>
>>>>>>> RISC-V specification permits the caching of PTEs whose V (Valid)
>>>>>>> bit is clear. Operating systems must be written to cope with this
>>>>>>> possibility, but implementers are reminded that eagerly caching
>>>>>>> invalid PTEs will reduce performance by causing additional page
>>>>>>> faults.
>>>>>>>
>>>>>>> So we must keep vmalloc_fault for the spurious page faults of kernel
>>>>>>> virtual address from an OoO machine.
>>>>>>>
>>>>>>> Signed-off-by: Guo Ren <guoren@linux.alibaba.com>
>>>>>>> Signed-off-by: Guo Ren <guoren@kernel.org>
>>>>>>> ---
>>>>>>>      arch/riscv/mm/fault.c | 3 +--
>>>>>>>      1 file changed, 1 insertion(+), 2 deletions(-)
>>>>>>>
>>>>>>> diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
>>>>>>> index 85165fe438d8..f662c9eae7d4 100644
>>>>>>> --- a/arch/riscv/mm/fault.c
>>>>>>> +++ b/arch/riscv/mm/fault.c
>>>>>>> @@ -258,8 +258,7 @@ void handle_page_fault(struct pt_regs *regs)
>>>>>>>           * only copy the information from the master page table,
>>>>>>>           * nothing more.
>>>>>>>           */
>>>>>>> -     if ((!IS_ENABLED(CONFIG_MMU) || !IS_ENABLED(CONFIG_64BIT)) &&
>>>>>>> -         unlikely(addr >= VMALLOC_START && addr < VMALLOC_END)) {
>>>>>>> +     if (unlikely(addr >= TASK_SIZE)) {
>>>>>>>                  vmalloc_fault(regs, code, addr);
>>>>>>>                  return;
>>>>>>>          }
>>>>>> Can you share what you are trying to fix here?
>>>>> We met a spurious page fault panic on an OoO machine.
>>>>>
>>>>> 1. The processor speculative execution brings the V=0 entries into the
>>>>> TLB in the kernel virtual address.
>>>>> 2. Linux kernel installs the kernel virtual address with the page, and V=1
>>>>> 3. When kernel code access the kernel virtual address, it would raise
>>>>> a page fault as the V=0 entry in the tlb.
>>>>> 4. No vmalloc_fault, then panic.
>>>>>
>>>>>> I have a fix (that's currently running our CI) for commit 7d3332be011e
>>>>>> ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") that
>>>>>> implements flush_cache_vmap() since we lost the vmalloc_fault.
>>>>> Could you share that patch?
>>>> Here we go:
>>>>
>>>>
>>>> Author: Alexandre Ghiti <alexghiti@rivosinc.com>
>>>> Date:   Fri Jul 21 08:43:44 2023 +0000
>>>>
>>>>        riscv: Implement flush_cache_vmap()
>>>>
>>>>        The RISC-V kernel needs a sfence.vma after a page table
>>>> modification: we
>>>>        used to rely on the vmalloc fault handling to emit an sfence.vma, but
>>>>        commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>>>>        vmalloc/modules area") got rid of this path for 64-bit kernels, so
>>>> now we
>>>>        need to explicitly emit a sfence.vma in flush_cache_vmap().
>>>>
>>>>        Note that we don't need to implement flush_cache_vunmap() as the
>>>> generic
>>>>        code should emit a flush tlb after unmapping a vmalloc region.
>>>>
>>>>        Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
>>>> vmalloc/modules area")
>>>>        Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
>>>>
>>>> diff --git a/arch/riscv/include/asm/cacheflush.h
>>>> b/arch/riscv/include/asm/cacheflush.h
>>>> index 8091b8bf4883..b93ffddf8a61 100644
>>>> --- a/arch/riscv/include/asm/cacheflush.h
>>>> +++ b/arch/riscv/include/asm/cacheflush.h
>>>> @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
>>>>     #define flush_icache_user_page(vma, pg, addr, len) \
>>>>            flush_icache_mm(vma->vm_mm, 0)
>>>>
>>>> +#ifdef CONFIG_64BIT
>>>> +#define flush_cache_vmap(start, end) flush_tlb_kernel_range(start, end)
>>>> +#endif
>>> I don't want that, and flush_tlb_kernel_range is flush_tlb_all. In
>>> addition, it would call IPI, which is a performance killer.
>>
>> At the moment, flush_tlb_kernel_range() indeed calls flush_tlb_all() but
>> that needs to be fixed, see my last patchset
>> https://lore.kernel.org/linux-riscv/20230711075434.10936-1-alexghiti@rivosinc.com/.
>>
>> But can you at least check that this fixes your issue? It would be
>> interesting to see if the problem comes from vmalloc or something else.
> It could solve my issue.
>
>>
>>> What's the problem of spurious fault replay? It only costs a
>>> local_tlb_flush with vaddr.
>>
>> We had this exact discussion internally this week, and the fault replay
>> seems like a solution. But that needs to be thought carefully: the
>> vmalloc fault was removed for a reason (see Bjorn commit log), tracing
>> functions can use vmalloc() in the path of the vmalloc fault, causing an
>> infinite trap loop. And here you are simply re-enabling this problem.
> Thx for mentioning it, and I will solve it in the next version of the patch:
>
> -static inline void vmalloc_fault(struct pt_regs *regs, int code,
> unsigned long addr)
> +static void notrace vmalloc_fault(struct pt_regs *regs, int code,
> unsigned long addr)


That does not solve the problem, since it should be any function called 
from the trap entry to vmalloc_fault() that needs to be marked a 
notrace. But it's not only tracing, it's all code that could vmalloc() 
something in the page fault path before reaching vmalloc_fault().

I have been rethinking about that,  it is very unlikely but the problem 
could happen, even if the microarchitecture does not cache invalid 
entries (but it's way more likely to happen when invalid entries are 
cached in the TLB)...And the only way I can think of to get rid of it is 
a preventive sfence.vma.


>
>> In
>> addition, this patch makes vmalloc_fault() catch *all* kernel faults in
>> the kernel address space, so any genuine kernel fault would loop forever
>> in vmalloc_fault().
> We check whether kernel vaddr is valid by the page_table, not range.
> I'm sure "the any genuine kernel fault would loop forever in
> vmalloc_fault()" is about what? Could you give an example?


In the patch you propose, vmalloc_fault() would intercept all faults 
triggered on kernel addresses, even protection traps. vmalloc_fault() 
only checks the presence of the faulting entry in the kernel page table, 
not its permissions. So any protection trap on kernel addresses would 
loop forever in vmalloc_fault().


>
>>
>> For now, the simplest solution is to implement flush_cache_vmap()
>> because riscv needs a sfence.vma when adding a new mapping, and if
> It's not a local_tlb_flush, and it would ipi broadcast all harts.
> on_each_cpu(__ipi_flush_tlb_all, NULL, 1);
>
> That's too horrible.
>
> Some custom drivers or test codes may care about it.


Yes, and there is vmap stack too...


>
>> that's a "performance killer", let's measure that and implement
>> something like this patch is trying to do. I may be wrong, but there
>> aren't many new kernel mappings that would require a call to
>> flush_cache_vmap() so I disagree with the performance killer argument,
>> but happy to be proven otherwise!
> 1. I agree to pre-allocate pgd entries. It's good for performance, but
> don't do that when Sv32.


Nothing changes for sv32, vmalloc_fault() is still there (but to me, 
there is still the problem with vmalloc() done before reaching 
vmalloc_fault()).


> 2. We still need vmalloc_fault to match ISA spec requirements. (Some
> vendors' microarchitectures (e.g., T-HEAD c910) could prevent V=0 into
> TLB when PTW, then they don't need it.)


I disagree, even if invalid entries are not cached in the TLB, there 
still exists a small window where we could take a trap on a newly 
created vmalloc mapping which would require vmalloc_fault(). Or to 
prevent that, we need flush_cache_vmap().


> 3. Only when vmalloc_fault can't solve the problem, then let's think
> about the flush_cache_vmap() solution.


To me, vmalloc_fault() can't solve the problem, even when invalid 
entries are not cached in the TLB. So I agree with you that 
flush_cache_vmap() is not that great, but I don't see any other 
straightforward solution here. Maybe a very early sfence.vma in the page 
fault path, before anything gets called, but we need a solution for 6.5.


>
>> Thanks,
>>
>> Alex
>>
>>
>>>> +
>>>>     #ifndef CONFIG_SMP
>>>>
>>>>     #define flush_icache_all() local_flush_icache_all()
>>>>
>>>>
>>>> Let me know if that works for you!
>>>>
>>>>
>>> --
>>> Best Regards
>>>    Guo Ren
>
>
> --
> Best Regards
>   Guo Ren

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-07-31  9:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-21 14:51 [PATCH] riscv: mm: Fixup spurious fault of kernel vaddr guoren
2023-07-21 15:18 ` Alexandre Ghiti
2023-07-21 16:08   ` Guo Ren
2023-07-21 20:00     ` Alexandre Ghiti
2023-07-21 23:59       ` Guo Ren
2023-07-24  9:05         ` Alexandre Ghiti
2023-07-29  6:40           ` Guo Ren
2023-07-31  9:43             ` Alexandre Ghiti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox