Linux-RISC-V Archive on lore.kernel.org
 help / color / mirror / Atom feed
* About the Use of sfence.vma in Kernel
@ 2018-11-01  9:00 Alan Kao
  2018-11-01  9:00 ` Alan Kao
  2018-11-05  0:49 ` [sw-dev] " Alan Kao
  0 siblings, 2 replies; 14+ messages in thread
From: Alan Kao @ 2018-11-01  9:00 UTC (permalink / raw)
  To: linux-riscv

Hi all,

As mentioned in the Privileged Spec about sfence.vma instruction:

> The supervisor memory-management fence instruction SFENCE.VMA is used
> to synchronize updates to in-memory memory-management data structures 
> with current execution.  Instruction execution causes implicit reads 
> and writes to these data structures;  however, these implicit references
> are ordinarily not ordered with respect to loads and stores in the instruction
> stream.
>
> Executing an SFENCE.VMA instruction guarantees that any stores in the
> instruction stream prior to the SFENCE.VMA are ordered before all implicit
> references subsequent to the SFENCE.VMA.

It naturally follows that we should use sfence.vma once the page table is
modified.  There are several examples in the kernel already, such as

alloc_set_pte (in mm/memory.c):
...
        set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
        /* no need to invalidate: a not-present page won't be cached */
        update_mmu_cache(vma, vmf->address, vmf->pte);
...
where the update_mmu_cache function eventually issues a sfence.vma.

I was interested if it is always the case and did some research.  RV64 uses
3-level of page table entry, pud, pmd and pte, so I traced a little bit about
the code flow after set_pud, set_pmd and set_pte.

It turns out that some of the calls to them are not followed by a
sfence.vma.  For an instance, in the vmalloc_fault region in do_page_fault,
there is no sfence.vma or calls to it after set_pgd, which directs to set_pud
later.

Are they bugs or I just misunderstand the instruction?  As the kernel has
already been stable for quite a while now, it is not likely to be a critical
bug.

Any clarification will highly appreciated.

Many thanks,
Alan Kao

^ permalink raw reply	[flat|nested] 14+ messages in thread

* About the Use of sfence.vma in Kernel
  2018-11-01  9:00 About the Use of sfence.vma in Kernel Alan Kao
@ 2018-11-01  9:00 ` Alan Kao
  2018-11-05  0:49 ` [sw-dev] " Alan Kao
  1 sibling, 0 replies; 14+ messages in thread
From: Alan Kao @ 2018-11-01  9:00 UTC (permalink / raw)
  To: linux-riscv, sw-dev; +Cc: greentime

Hi all,

As mentioned in the Privileged Spec about sfence.vma instruction:

> The supervisor memory-management fence instruction SFENCE.VMA is used
> to synchronize updates to in-memory memory-management data structures 
> with current execution.  Instruction execution causes implicit reads 
> and writes to these data structures;  however, these implicit references
> are ordinarily not ordered with respect to loads and stores in the instruction
> stream.
>
> Executing an SFENCE.VMA instruction guarantees that any stores in the
> instruction stream prior to the SFENCE.VMA are ordered before all implicit
> references subsequent to the SFENCE.VMA.

It naturally follows that we should use sfence.vma once the page table is
modified.  There are several examples in the kernel already, such as

alloc_set_pte (in mm/memory.c):
...
        set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
        /* no need to invalidate: a not-present page won't be cached */
        update_mmu_cache(vma, vmf->address, vmf->pte);
...
where the update_mmu_cache function eventually issues a sfence.vma.

I was interested if it is always the case and did some research.  RV64 uses
3-level of page table entry, pud, pmd and pte, so I traced a little bit about
the code flow after set_pud, set_pmd and set_pte.

It turns out that some of the calls to them are not followed by a
sfence.vma.  For an instance, in the vmalloc_fault region in do_page_fault,
there is no sfence.vma or calls to it after set_pgd, which directs to set_pud
later.

Are they bugs or I just misunderstand the instruction?  As the kernel has
already been stable for quite a while now, it is not likely to be a critical
bug.

Any clarification will highly appreciated.

Many thanks,
Alan Kao

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-01  9:00 About the Use of sfence.vma in Kernel Alan Kao
  2018-11-01  9:00 ` Alan Kao
@ 2018-11-05  0:49 ` Alan Kao
  2018-11-05  0:49   ` Alan Kao
  2018-11-06  2:33   ` Palmer Dabbelt
  1 sibling, 2 replies; 14+ messages in thread
From: Alan Kao @ 2018-11-05  0:49 UTC (permalink / raw)
  To: linux-riscv

Hi Palmer,

I believe the code in arch/riscv/mm/fault.c is mostly from you.
Do you have any comments on this?

On Thu, Nov 01, 2018 at 05:00:15PM +0800, Alan Kao wrote:
> Hi all,
> 
> As mentioned in the Privileged Spec about sfence.vma instruction:
> 
> > The supervisor memory-management fence instruction SFENCE.VMA is used
> > to synchronize updates to in-memory memory-management data structures 
> > with current execution.  Instruction execution causes implicit reads 
> > and writes to these data structures;  however, these implicit references
> > are ordinarily not ordered with respect to loads and stores in the instruction
> > stream.
> >
> > Executing an SFENCE.VMA instruction guarantees that any stores in the
> > instruction stream prior to the SFENCE.VMA are ordered before all implicit
> > references subsequent to the SFENCE.VMA.
> 
> It naturally follows that we should use sfence.vma once the page table is
> modified.  There are several examples in the kernel already, such as
> 
> alloc_set_pte (in mm/memory.c):
> ...
>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>         /* no need to invalidate: a not-present page won't be cached */
>         update_mmu_cache(vma, vmf->address, vmf->pte);
> ...
> where the update_mmu_cache function eventually issues a sfence.vma.
> 
> I was interested if it is always the case and did some research.  RV64 uses
> 3-level of page table entry, pud, pmd and pte, so I traced a little bit about
> the code flow after set_pud, set_pmd and set_pte.
> 
> It turns out that some of the calls to them are not followed by a
> sfence.vma.  For an instance, in the vmalloc_fault region in do_page_fault,
> there is no sfence.vma or calls to it after set_pgd, which directs to set_pud
> later.
> 
> Are they bugs or I just misunderstand the instruction?  As the kernel has
> already been stable for quite a while now, it is not likely to be a critical
> bug.
> 
> Any clarification will highly appreciated.
> 
> Many thanks,
> Alan Kao
> 
> -- 
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe at groups.riscv.org.
> To post to this group, send email to sw-dev at groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/20181101090015.GA6997%40andestech.com.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-05  0:49 ` [sw-dev] " Alan Kao
@ 2018-11-05  0:49   ` Alan Kao
  2018-11-06  2:33   ` Palmer Dabbelt
  1 sibling, 0 replies; 14+ messages in thread
From: Alan Kao @ 2018-11-05  0:49 UTC (permalink / raw)
  To: palmer; +Cc: linux-riscv, sw-dev, greentime

Hi Palmer,

I believe the code in arch/riscv/mm/fault.c is mostly from you.
Do you have any comments on this?

On Thu, Nov 01, 2018 at 05:00:15PM +0800, Alan Kao wrote:
> Hi all,
> 
> As mentioned in the Privileged Spec about sfence.vma instruction:
> 
> > The supervisor memory-management fence instruction SFENCE.VMA is used
> > to synchronize updates to in-memory memory-management data structures 
> > with current execution.  Instruction execution causes implicit reads 
> > and writes to these data structures;  however, these implicit references
> > are ordinarily not ordered with respect to loads and stores in the instruction
> > stream.
> >
> > Executing an SFENCE.VMA instruction guarantees that any stores in the
> > instruction stream prior to the SFENCE.VMA are ordered before all implicit
> > references subsequent to the SFENCE.VMA.
> 
> It naturally follows that we should use sfence.vma once the page table is
> modified.  There are several examples in the kernel already, such as
> 
> alloc_set_pte (in mm/memory.c):
> ...
>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>         /* no need to invalidate: a not-present page won't be cached */
>         update_mmu_cache(vma, vmf->address, vmf->pte);
> ...
> where the update_mmu_cache function eventually issues a sfence.vma.
> 
> I was interested if it is always the case and did some research.  RV64 uses
> 3-level of page table entry, pud, pmd and pte, so I traced a little bit about
> the code flow after set_pud, set_pmd and set_pte.
> 
> It turns out that some of the calls to them are not followed by a
> sfence.vma.  For an instance, in the vmalloc_fault region in do_page_fault,
> there is no sfence.vma or calls to it after set_pgd, which directs to set_pud
> later.
> 
> Are they bugs or I just misunderstand the instruction?  As the kernel has
> already been stable for quite a while now, it is not likely to be a critical
> bug.
> 
> Any clarification will highly appreciated.
> 
> Many thanks,
> Alan Kao
> 
> -- 
> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.
> To post to this group, send email to sw-dev@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/20181101090015.GA6997%40andestech.com.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-05  0:49 ` [sw-dev] " Alan Kao
  2018-11-05  0:49   ` Alan Kao
@ 2018-11-06  2:33   ` Palmer Dabbelt
  2018-11-06  2:33     ` Palmer Dabbelt
                       ` (3 more replies)
  1 sibling, 4 replies; 14+ messages in thread
From: Palmer Dabbelt @ 2018-11-06  2:33 UTC (permalink / raw)
  To: linux-riscv

Sorry, I missedy our original email.

On Sun, 04 Nov 2018 16:49:29 PST (-0800), alankao at andestech.com wrote:
> Hi Palmer,
>
> I believe the code in arch/riscv/mm/fault.c is mostly from you.
> Do you have any comments on this?
>
> On Thu, Nov 01, 2018 at 05:00:15PM +0800, Alan Kao wrote:
>> Hi all,
>>
>> As mentioned in the Privileged Spec about sfence.vma instruction:
>>
>> > The supervisor memory-management fence instruction SFENCE.VMA is used
>> > to synchronize updates to in-memory memory-management data structures
>> > with current execution.  Instruction execution causes implicit reads
>> > and writes to these data structures;  however, these implicit references
>> > are ordinarily not ordered with respect to loads and stores in the instruction
>> > stream.
>> >
>> > Executing an SFENCE.VMA instruction guarantees that any stores in the
>> > instruction stream prior to the SFENCE.VMA are ordered before all implicit
>> > references subsequent to the SFENCE.VMA.
>>
>> It naturally follows that we should use sfence.vma once the page table is
>> modified.  There are several examples in the kernel already, such as
>>
>> alloc_set_pte (in mm/memory.c):
>> ...
>>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>         /* no need to invalidate: a not-present page won't be cached */
>>         update_mmu_cache(vma, vmf->address, vmf->pte);
>> ...
>> where the update_mmu_cache function eventually issues a sfence.vma.
>>
>> I was interested if it is always the case and did some research.  RV64 uses
>> 3-level of page table entry, pud, pmd and pte, so I traced a little bit about
>> the code flow after set_pud, set_pmd and set_pte.
>>
>> It turns out that some of the calls to them are not followed by a
>> sfence.vma.  For an instance, in the vmalloc_fault region in do_page_fault,
>> there is no sfence.vma or calls to it after set_pgd, which directs to set_pud
>> later.

This specific one looks like a bug: we're trying to fill out the page table for 
the vmalloc region, but we'll just continue trapping without an "sfence.vma".  
The path between poking the page tables and the sret is pretty short and 
doesn't appear to ever have an "sfence.vma", so I'm not sure how this could 
work.

>>
>> Are they bugs or I just misunderstand the instruction?  As the kernel has
>> already been stable for quite a while now, it is not likely to be a critical
>> bug.
>>
>> Any clarification will highly appreciated.

Well, certainly from this it looks pretty broken -- and in a manner I'd expect 
to trigger frequently.  There are no fences in any of the other similar-looking 
implementations.

Maybe I'm missing something here?

FWIW, if I apply the following diff

    diff --git a/arch/riscv/kernel/reset.c b/arch/riscv/kernel/reset.c
    index 2a53d26ffdd6..fbd132d388fb 100644
    --- a/arch/riscv/kernel/reset.c
    +++ b/arch/riscv/kernel/reset.c
    @@ -15,6 +15,8 @@
     #include <linux/export.h>
     #include <asm/sbi.h>
    
    +extern long vmalloc_faults;
    +
     void (*pm_power_off)(void) = machine_power_off;
     EXPORT_SYMBOL(pm_power_off);
    
    @@ -31,6 +33,7 @@ void machine_halt(void)
    
     void machine_power_off(void)
     {
    +	printk("vmalloc faults: %ld\n", vmalloc_faults);
     	sbi_shutdown();
     	while (1);
     }
    diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
    index 88401d5125bc..61ef1128632c 100644
    --- a/arch/riscv/mm/fault.c
    +++ b/arch/riscv/mm/fault.c
    @@ -30,6 +30,8 @@
     #include <asm/pgalloc.h>
     #include <asm/ptrace.h>
    
    +long vmalloc_faults = 0;
    +
     /*
      * This routine handles page faults.  It determines the address and the
      * problem, and then passes it off to one of the appropriate routines.
    @@ -281,6 +283,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
     		pte_k = pte_offset_kernel(pmd_k, addr);
     		if (!pte_present(*pte_k))
     			goto no_context;
    +
    +		vmalloc_faults++;
     		return;
     	}
     }

I get only a single vmalloc fault when doing a boot+shutdown of Fedora in QEMU, 
so maybe this just slipped through the cracks?

>>
>> Many thanks,
>> Alan Kao
>>
>> --
>> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe at groups.riscv.org.
>> To post to this group, send email to sw-dev at groups.riscv.org.
>> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
>> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/20181101090015.GA6997%40andestech.com.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-06  2:33   ` Palmer Dabbelt
@ 2018-11-06  2:33     ` Palmer Dabbelt
  2018-11-06  7:49     ` Alan Kao
                       ` (2 subsequent siblings)
  3 siblings, 0 replies; 14+ messages in thread
From: Palmer Dabbelt @ 2018-11-06  2:33 UTC (permalink / raw)
  To: alankao; +Cc: linux-riscv, sw-dev, greentime

Sorry, I missedy our original email.

On Sun, 04 Nov 2018 16:49:29 PST (-0800), alankao@andestech.com wrote:
> Hi Palmer,
>
> I believe the code in arch/riscv/mm/fault.c is mostly from you.
> Do you have any comments on this?
>
> On Thu, Nov 01, 2018 at 05:00:15PM +0800, Alan Kao wrote:
>> Hi all,
>>
>> As mentioned in the Privileged Spec about sfence.vma instruction:
>>
>> > The supervisor memory-management fence instruction SFENCE.VMA is used
>> > to synchronize updates to in-memory memory-management data structures
>> > with current execution.  Instruction execution causes implicit reads
>> > and writes to these data structures;  however, these implicit references
>> > are ordinarily not ordered with respect to loads and stores in the instruction
>> > stream.
>> >
>> > Executing an SFENCE.VMA instruction guarantees that any stores in the
>> > instruction stream prior to the SFENCE.VMA are ordered before all implicit
>> > references subsequent to the SFENCE.VMA.
>>
>> It naturally follows that we should use sfence.vma once the page table is
>> modified.  There are several examples in the kernel already, such as
>>
>> alloc_set_pte (in mm/memory.c):
>> ...
>>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>         /* no need to invalidate: a not-present page won't be cached */
>>         update_mmu_cache(vma, vmf->address, vmf->pte);
>> ...
>> where the update_mmu_cache function eventually issues a sfence.vma.
>>
>> I was interested if it is always the case and did some research.  RV64 uses
>> 3-level of page table entry, pud, pmd and pte, so I traced a little bit about
>> the code flow after set_pud, set_pmd and set_pte.
>>
>> It turns out that some of the calls to them are not followed by a
>> sfence.vma.  For an instance, in the vmalloc_fault region in do_page_fault,
>> there is no sfence.vma or calls to it after set_pgd, which directs to set_pud
>> later.

This specific one looks like a bug: we're trying to fill out the page table for 
the vmalloc region, but we'll just continue trapping without an "sfence.vma".  
The path between poking the page tables and the sret is pretty short and 
doesn't appear to ever have an "sfence.vma", so I'm not sure how this could 
work.

>>
>> Are they bugs or I just misunderstand the instruction?  As the kernel has
>> already been stable for quite a while now, it is not likely to be a critical
>> bug.
>>
>> Any clarification will highly appreciated.

Well, certainly from this it looks pretty broken -- and in a manner I'd expect 
to trigger frequently.  There are no fences in any of the other similar-looking 
implementations.

Maybe I'm missing something here?

FWIW, if I apply the following diff

    diff --git a/arch/riscv/kernel/reset.c b/arch/riscv/kernel/reset.c
    index 2a53d26ffdd6..fbd132d388fb 100644
    --- a/arch/riscv/kernel/reset.c
    +++ b/arch/riscv/kernel/reset.c
    @@ -15,6 +15,8 @@
     #include <linux/export.h>
     #include <asm/sbi.h>
    
    +extern long vmalloc_faults;
    +
     void (*pm_power_off)(void) = machine_power_off;
     EXPORT_SYMBOL(pm_power_off);
    
    @@ -31,6 +33,7 @@ void machine_halt(void)
    
     void machine_power_off(void)
     {
    +	printk("vmalloc faults: %ld\n", vmalloc_faults);
     	sbi_shutdown();
     	while (1);
     }
    diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
    index 88401d5125bc..61ef1128632c 100644
    --- a/arch/riscv/mm/fault.c
    +++ b/arch/riscv/mm/fault.c
    @@ -30,6 +30,8 @@
     #include <asm/pgalloc.h>
     #include <asm/ptrace.h>
    
    +long vmalloc_faults = 0;
    +
     /*
      * This routine handles page faults.  It determines the address and the
      * problem, and then passes it off to one of the appropriate routines.
    @@ -281,6 +283,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
     		pte_k = pte_offset_kernel(pmd_k, addr);
     		if (!pte_present(*pte_k))
     			goto no_context;
    +
    +		vmalloc_faults++;
     		return;
     	}
     }

I get only a single vmalloc fault when doing a boot+shutdown of Fedora in QEMU, 
so maybe this just slipped through the cracks?

>>
>> Many thanks,
>> Alan Kao
>>
>> --
>> You received this message because you are subscribed to the Google Groups "RISC-V SW Dev" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to sw-dev+unsubscribe@groups.riscv.org.
>> To post to this group, send email to sw-dev@groups.riscv.org.
>> Visit this group at https://groups.google.com/a/groups.riscv.org/group/sw-dev/.
>> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/sw-dev/20181101090015.GA6997%40andestech.com.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-06  2:33   ` Palmer Dabbelt
  2018-11-06  2:33     ` Palmer Dabbelt
@ 2018-11-06  7:49     ` Alan Kao
  2018-11-06  7:49       ` Alan Kao
  2018-11-06  8:03     ` Andreas Schwab
  2018-11-06 10:46     ` Nick Kossifidis
  3 siblings, 1 reply; 14+ messages in thread
From: Alan Kao @ 2018-11-06  7:49 UTC (permalink / raw)
  To: linux-riscv

Thanks for the response!

On Mon, Nov 05, 2018 at 06:33:23PM -0800, Palmer Dabbelt wrote:
> Sorry, I missedy our original email.
> 
> On Sun, 04 Nov 2018 16:49:29 PST (-0800), alankao at andestech.com wrote:
> >Hi Palmer,
> >
> >I believe the code in arch/riscv/mm/fault.c is mostly from you.
> >Do you have any comments on this?
> >
> >On Thu, Nov 01, 2018 at 05:00:15PM +0800, Alan Kao wrote:
> >>Hi all,
> >>
> >>As mentioned in the Privileged Spec about sfence.vma instruction:
> >>
> >>> The supervisor memory-management fence instruction SFENCE.VMA is used
> >>> to synchronize updates to in-memory memory-management data structures
> >>> with current execution.  Instruction execution causes implicit reads
> >>> and writes to these data structures;  however, these implicit references
> >>> are ordinarily not ordered with respect to loads and stores in the instruction
> >>> stream.
> >>>
> >>> Executing an SFENCE.VMA instruction guarantees that any stores in the
> >>> instruction stream prior to the SFENCE.VMA are ordered before all implicit
> >>> references subsequent to the SFENCE.VMA.
> >>
> >>It naturally follows that we should use sfence.vma once the page table is
> >>modified.  There are several examples in the kernel already, such as
> >>
> >>alloc_set_pte (in mm/memory.c):
> >>...
> >>        set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >>        /* no need to invalidate: a not-present page won't be cached */
> >>        update_mmu_cache(vma, vmf->address, vmf->pte);
> >>...
> >>where the update_mmu_cache function eventually issues a sfence.vma.
> >>
> >>I was interested if it is always the case and did some research.  RV64 uses
> >>3-level of page table entry, pud, pmd and pte, so I traced a little bit about
> >>the code flow after set_pud, set_pmd and set_pte.
> >>
> >>It turns out that some of the calls to them are not followed by a
> >>sfence.vma.  For an instance, in the vmalloc_fault region in do_page_fault,
> >>there is no sfence.vma or calls to it after set_pgd, which directs to set_pud
> >>later.
> 
> This specific one looks like a bug: we're trying to fill out the page table
> for the vmalloc region, but we'll just continue trapping without an
> "sfence.vma".  The path between poking the page tables and the sret is
> pretty short and doesn't appear to ever have an "sfence.vma", so I'm not
> sure how this could work.

I have some discussion with our hardware guys and architects previously.
Their hypothesis is that, the translation hardware on your board could somehow
snoop dcache, so that a update like this seamlessly work out any subsequent
VA-to-PA process.  Would you like to help verify this?  

riscv_virt board on QEMU doesn't matter because translation is not modeled.

> >>
> >>Are they bugs or I just misunderstand the instruction?  As the kernel has
> >>already been stable for quite a while now, it is not likely to be a critical
> >>bug.
> >>
> >>Any clarification will highly appreciated.
> 
> Well, certainly from this it looks pretty broken -- and in a manner I'd
> expect to trigger frequently.  There are no fences in any of the other
> similar-looking implementations.
> 
> Maybe I'm missing something here?

Actually, this set_pud is just one instance of this problem.

As mentioned in the previous mail, I've traced the current codebase to see which
functions call either set_pte, set_pmd, or set_pud.  Under my compiler 
optimization setting and environment, I found the following functions containing 
inlined et_p** but no obvious sfence.vma follows, just for your information:

PUD cases:
__pmd_alloc
pud_clear_bad

PMD cases:
pmd_clear_bad
__pte_alloc_kernel

Both PUD and PMD:
free_pgd_range
do_page_fault

PTE cases:
unmap_page_range
remap_pfn_range
copy_page_range
vm_insert_page
change_protection_range
page_mkclean_one
try_to_unmap_one
vmap_page_range_noflush
madvise_free_pte_range
remove_migration_pte
ioremap_page_range

Some also falls in the grey area.  For example, finish_mkwrite_fault
has an instruction sequence like

> 	sd	s3,0(s1)       // *ptep = pte
> 	ld	a5,24(s2)
> sfence.vma	a5

in which sfence.vma cannot follow by the PTE update immediately because
we would like to load the PTE value into $a5, which stands for the leaf PTE.

> 
> FWIW, if I apply the following diff
> 
>    diff --git a/arch/riscv/kernel/reset.c b/arch/riscv/kernel/reset.c
>    index 2a53d26ffdd6..fbd132d388fb 100644
>    --- a/arch/riscv/kernel/reset.c
>    +++ b/arch/riscv/kernel/reset.c
>    @@ -15,6 +15,8 @@
>     #include <linux/export.h>
>     #include <asm/sbi.h>
>    +extern long vmalloc_faults;
>    +
>     void (*pm_power_off)(void) = machine_power_off;
>     EXPORT_SYMBOL(pm_power_off);
>    @@ -31,6 +33,7 @@ void machine_halt(void)
>     void machine_power_off(void)
>     {
>    +	printk("vmalloc faults: %ld\n", vmalloc_faults);
>     	sbi_shutdown();
>     	while (1);
>     }
>    diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
>    index 88401d5125bc..61ef1128632c 100644
>    --- a/arch/riscv/mm/fault.c
>    +++ b/arch/riscv/mm/fault.c
>    @@ -30,6 +30,8 @@
>     #include <asm/pgalloc.h>
>     #include <asm/ptrace.h>
>    +long vmalloc_faults = 0;
>    +
>     /*
>      * This routine handles page faults.  It determines the address and the
>      * problem, and then passes it off to one of the appropriate routines.
>    @@ -281,6 +283,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>     		pte_k = pte_offset_kernel(pmd_k, addr);
>     		if (!pte_present(*pte_k))
>     			goto no_context;
>    +
>    +		vmalloc_faults++;
>     		return;
>     	}
>     }
> 
> I get only a single vmalloc fault when doing a boot+shutdown of Fedora in
> QEMU, so maybe this just slipped through the cracks?

Thanks for the experiment, but as there are many other cases listed above,
maybe the fastest way to figure this mysterious out is to check the details
from your hardware people?  IMHO the hypothesis sounds possible.  Once that
is determined, we can figure out what to do to these pagetable updates
with ease.

Alan

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-06  7:49     ` Alan Kao
@ 2018-11-06  7:49       ` Alan Kao
  0 siblings, 0 replies; 14+ messages in thread
From: Alan Kao @ 2018-11-06  7:49 UTC (permalink / raw)
  To: Palmer Dabbelt; +Cc: linux-riscv, sw-dev, greentime

Thanks for the response!

On Mon, Nov 05, 2018 at 06:33:23PM -0800, Palmer Dabbelt wrote:
> Sorry, I missedy our original email.
> 
> On Sun, 04 Nov 2018 16:49:29 PST (-0800), alankao@andestech.com wrote:
> >Hi Palmer,
> >
> >I believe the code in arch/riscv/mm/fault.c is mostly from you.
> >Do you have any comments on this?
> >
> >On Thu, Nov 01, 2018 at 05:00:15PM +0800, Alan Kao wrote:
> >>Hi all,
> >>
> >>As mentioned in the Privileged Spec about sfence.vma instruction:
> >>
> >>> The supervisor memory-management fence instruction SFENCE.VMA is used
> >>> to synchronize updates to in-memory memory-management data structures
> >>> with current execution.  Instruction execution causes implicit reads
> >>> and writes to these data structures;  however, these implicit references
> >>> are ordinarily not ordered with respect to loads and stores in the instruction
> >>> stream.
> >>>
> >>> Executing an SFENCE.VMA instruction guarantees that any stores in the
> >>> instruction stream prior to the SFENCE.VMA are ordered before all implicit
> >>> references subsequent to the SFENCE.VMA.
> >>
> >>It naturally follows that we should use sfence.vma once the page table is
> >>modified.  There are several examples in the kernel already, such as
> >>
> >>alloc_set_pte (in mm/memory.c):
> >>...
> >>        set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
> >>        /* no need to invalidate: a not-present page won't be cached */
> >>        update_mmu_cache(vma, vmf->address, vmf->pte);
> >>...
> >>where the update_mmu_cache function eventually issues a sfence.vma.
> >>
> >>I was interested if it is always the case and did some research.  RV64 uses
> >>3-level of page table entry, pud, pmd and pte, so I traced a little bit about
> >>the code flow after set_pud, set_pmd and set_pte.
> >>
> >>It turns out that some of the calls to them are not followed by a
> >>sfence.vma.  For an instance, in the vmalloc_fault region in do_page_fault,
> >>there is no sfence.vma or calls to it after set_pgd, which directs to set_pud
> >>later.
> 
> This specific one looks like a bug: we're trying to fill out the page table
> for the vmalloc region, but we'll just continue trapping without an
> "sfence.vma".  The path between poking the page tables and the sret is
> pretty short and doesn't appear to ever have an "sfence.vma", so I'm not
> sure how this could work.

I have some discussion with our hardware guys and architects previously.
Their hypothesis is that, the translation hardware on your board could somehow
snoop dcache, so that a update like this seamlessly work out any subsequent
VA-to-PA process.  Would you like to help verify this?  

riscv_virt board on QEMU doesn't matter because translation is not modeled.

> >>
> >>Are they bugs or I just misunderstand the instruction?  As the kernel has
> >>already been stable for quite a while now, it is not likely to be a critical
> >>bug.
> >>
> >>Any clarification will highly appreciated.
> 
> Well, certainly from this it looks pretty broken -- and in a manner I'd
> expect to trigger frequently.  There are no fences in any of the other
> similar-looking implementations.
> 
> Maybe I'm missing something here?

Actually, this set_pud is just one instance of this problem.

As mentioned in the previous mail, I've traced the current codebase to see which
functions call either set_pte, set_pmd, or set_pud.  Under my compiler 
optimization setting and environment, I found the following functions containing 
inlined et_p** but no obvious sfence.vma follows, just for your information:

PUD cases:
__pmd_alloc
pud_clear_bad

PMD cases:
pmd_clear_bad
__pte_alloc_kernel

Both PUD and PMD:
free_pgd_range
do_page_fault

PTE cases:
unmap_page_range
remap_pfn_range
copy_page_range
vm_insert_page
change_protection_range
page_mkclean_one
try_to_unmap_one
vmap_page_range_noflush
madvise_free_pte_range
remove_migration_pte
ioremap_page_range

Some also falls in the grey area.  For example, finish_mkwrite_fault
has an instruction sequence like

> 	sd	s3,0(s1)       // *ptep = pte
> 	ld	a5,24(s2)
> sfence.vma	a5

in which sfence.vma cannot follow by the PTE update immediately because
we would like to load the PTE value into $a5, which stands for the leaf PTE.

> 
> FWIW, if I apply the following diff
> 
>    diff --git a/arch/riscv/kernel/reset.c b/arch/riscv/kernel/reset.c
>    index 2a53d26ffdd6..fbd132d388fb 100644
>    --- a/arch/riscv/kernel/reset.c
>    +++ b/arch/riscv/kernel/reset.c
>    @@ -15,6 +15,8 @@
>     #include <linux/export.h>
>     #include <asm/sbi.h>
>    +extern long vmalloc_faults;
>    +
>     void (*pm_power_off)(void) = machine_power_off;
>     EXPORT_SYMBOL(pm_power_off);
>    @@ -31,6 +33,7 @@ void machine_halt(void)
>     void machine_power_off(void)
>     {
>    +	printk("vmalloc faults: %ld\n", vmalloc_faults);
>     	sbi_shutdown();
>     	while (1);
>     }
>    diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
>    index 88401d5125bc..61ef1128632c 100644
>    --- a/arch/riscv/mm/fault.c
>    +++ b/arch/riscv/mm/fault.c
>    @@ -30,6 +30,8 @@
>     #include <asm/pgalloc.h>
>     #include <asm/ptrace.h>
>    +long vmalloc_faults = 0;
>    +
>     /*
>      * This routine handles page faults.  It determines the address and the
>      * problem, and then passes it off to one of the appropriate routines.
>    @@ -281,6 +283,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs)
>     		pte_k = pte_offset_kernel(pmd_k, addr);
>     		if (!pte_present(*pte_k))
>     			goto no_context;
>    +
>    +		vmalloc_faults++;
>     		return;
>     	}
>     }
> 
> I get only a single vmalloc fault when doing a boot+shutdown of Fedora in
> QEMU, so maybe this just slipped through the cracks?

Thanks for the experiment, but as there are many other cases listed above,
maybe the fastest way to figure this mysterious out is to check the details
from your hardware people?  IMHO the hypothesis sounds possible.  Once that
is determined, we can figure out what to do to these pagetable updates
with ease.

Alan

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-06  2:33   ` Palmer Dabbelt
  2018-11-06  2:33     ` Palmer Dabbelt
  2018-11-06  7:49     ` Alan Kao
@ 2018-11-06  8:03     ` Andreas Schwab
  2018-11-06  8:03       ` Andreas Schwab
  2018-11-07 15:51       ` Palmer Dabbelt
  2018-11-06 10:46     ` Nick Kossifidis
  3 siblings, 2 replies; 14+ messages in thread
From: Andreas Schwab @ 2018-11-06  8:03 UTC (permalink / raw)
  To: linux-riscv

Perhaps that't the reason I sometimes get errors like this:

[303420.500000] swap_info_get: Bad swap file entry 2000000000df09b1
[303420.500000] BUG: Bad page map in process struct-ret-3.ex  pte:6f84d8c0 pmd:8f579001
[303420.510000] addr:000000008a1610cd vm_flags:00100173 anon_vma:0000000036036fc9 mapping:          (null) index:3ffffe7
[303420.520000] file:          (null) fault:          (null) mmap:          (null) readpage:          (null)
[303420.530000] CPU: 1 PID: 2054 Comm: struct-ret-3.ex Not tainted 4.19.0-00040-g2d5ee99e76 #42
[303420.530000] Call Trace:
[303420.530000] [<ffffffe000c847d4>] walk_stackframe+0x0/0xa4
[303420.530000] [<ffffffe000c849d4>] show_stack+0x2a/0x34
[303420.530000] [<ffffffe0011a6800>] dump_stack+0x62/0x7c
[303420.530000] [<ffffffe000d55942>] print_bad_pte+0x146/0x18e
[303420.530000] [<ffffffe000d56df6>] unmap_page_range+0x33a/0x5ea
[303420.530000] [<ffffffe000d570d4>] unmap_single_vma+0x2e/0x40
[303420.530000] [<ffffffe000d57278>] unmap_vmas+0x42/0x7a
[303420.530000] [<ffffffe000d5cb92>] exit_mmap+0x7e/0x106
[303420.530000] [<ffffffe000c872aa>] mmput.part.2+0x26/0xa0
[303420.530000] [<ffffffe000c87344>] mmput+0x20/0x28
[303420.530000] [<ffffffe000c8ba24>] do_exit+0x238/0x7c0
[303420.530000] [<ffffffe000c8c006>] do_group_exit+0x2a/0x82
[303420.530000] [<ffffffe000c8c076>] __wake_up_parent+0x0/0x22
[303420.530000] [<ffffffe000c83722>] ret_from_syscall+0x0/0xe

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab at suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-06  8:03     ` Andreas Schwab
@ 2018-11-06  8:03       ` Andreas Schwab
  2018-11-07 15:51       ` Palmer Dabbelt
  1 sibling, 0 replies; 14+ messages in thread
From: Andreas Schwab @ 2018-11-06  8:03 UTC (permalink / raw)
  To: Palmer Dabbelt; +Cc: linux-riscv, sw-dev, alankao, greentime

Perhaps that't the reason I sometimes get errors like this:

[303420.500000] swap_info_get: Bad swap file entry 2000000000df09b1
[303420.500000] BUG: Bad page map in process struct-ret-3.ex  pte:6f84d8c0 pmd:8f579001
[303420.510000] addr:000000008a1610cd vm_flags:00100173 anon_vma:0000000036036fc9 mapping:          (null) index:3ffffe7
[303420.520000] file:          (null) fault:          (null) mmap:          (null) readpage:          (null)
[303420.530000] CPU: 1 PID: 2054 Comm: struct-ret-3.ex Not tainted 4.19.0-00040-g2d5ee99e76 #42
[303420.530000] Call Trace:
[303420.530000] [<ffffffe000c847d4>] walk_stackframe+0x0/0xa4
[303420.530000] [<ffffffe000c849d4>] show_stack+0x2a/0x34
[303420.530000] [<ffffffe0011a6800>] dump_stack+0x62/0x7c
[303420.530000] [<ffffffe000d55942>] print_bad_pte+0x146/0x18e
[303420.530000] [<ffffffe000d56df6>] unmap_page_range+0x33a/0x5ea
[303420.530000] [<ffffffe000d570d4>] unmap_single_vma+0x2e/0x40
[303420.530000] [<ffffffe000d57278>] unmap_vmas+0x42/0x7a
[303420.530000] [<ffffffe000d5cb92>] exit_mmap+0x7e/0x106
[303420.530000] [<ffffffe000c872aa>] mmput.part.2+0x26/0xa0
[303420.530000] [<ffffffe000c87344>] mmput+0x20/0x28
[303420.530000] [<ffffffe000c8ba24>] do_exit+0x238/0x7c0
[303420.530000] [<ffffffe000c8c006>] do_group_exit+0x2a/0x82
[303420.530000] [<ffffffe000c8c076>] __wake_up_parent+0x0/0x22
[303420.530000] [<ffffffe000c83722>] ret_from_syscall+0x0/0xe

Andreas.

-- 
Andreas Schwab, SUSE Labs, schwab@suse.de
GPG Key fingerprint = 0196 BAD8 1CE9 1970 F4BE  1748 E4D4 88E3 0EEA B9D7
"And now for something completely different."

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-06  2:33   ` Palmer Dabbelt
                       ` (2 preceding siblings ...)
  2018-11-06  8:03     ` Andreas Schwab
@ 2018-11-06 10:46     ` Nick Kossifidis
  2018-11-06 10:46       ` Nick Kossifidis
  3 siblings, 1 reply; 14+ messages in thread
From: Nick Kossifidis @ 2018-11-06 10:46 UTC (permalink / raw)
  To: linux-riscv

???? 2018-11-06 04:33, Palmer Dabbelt ??????:
> Sorry, I missedy our original email.
> 
> On Sun, 04 Nov 2018 16:49:29 PST (-0800), alankao at andestech.com wrote:
>> Hi Palmer,
>> 
>> I believe the code in arch/riscv/mm/fault.c is mostly from you.
>> Do you have any comments on this?
>> 
>> On Thu, Nov 01, 2018 at 05:00:15PM +0800, Alan Kao wrote:
>>> Hi all,
>>> 
>>> As mentioned in the Privileged Spec about sfence.vma instruction:
>>> 
>>> > The supervisor memory-management fence instruction SFENCE.VMA is used
>>> > to synchronize updates to in-memory memory-management data structures
>>> > with current execution.  Instruction execution causes implicit reads
>>> > and writes to these data structures;  however, these implicit references
>>> > are ordinarily not ordered with respect to loads and stores in the instruction
>>> > stream.
>>> >
>>> > Executing an SFENCE.VMA instruction guarantees that any stores in the
>>> > instruction stream prior to the SFENCE.VMA are ordered before all implicit
>>> > references subsequent to the SFENCE.VMA.
>>> 
>>> It naturally follows that we should use sfence.vma once the page 
>>> table is
>>> modified.  There are several examples in the kernel already, such as
>>> 
>>> alloc_set_pte (in mm/memory.c):
>>> ...
>>>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>>         /* no need to invalidate: a not-present page won't be cached 
>>> */
>>>         update_mmu_cache(vma, vmf->address, vmf->pte);
>>> ...
>>> where the update_mmu_cache function eventually issues a sfence.vma.
>>> 
>>> I was interested if it is always the case and did some research.  
>>> RV64 uses
>>> 3-level of page table entry, pud, pmd and pte, so I traced a little 
>>> bit about
>>> the code flow after set_pud, set_pmd and set_pte.
>>> 
>>> It turns out that some of the calls to them are not followed by a
>>> sfence.vma.  For an instance, in the vmalloc_fault region in 
>>> do_page_fault,
>>> there is no sfence.vma or calls to it after set_pgd, which directs to 
>>> set_pud
>>> later.
> 
> This specific one looks like a bug: we're trying to fill out the page
> table for the vmalloc region, but we'll just continue trapping without
> an "sfence.vma".  The path between poking the page tables and the sret
> is pretty short and doesn't appear to ever have an "sfence.vma", so
> I'm not sure how this could work.
> 
>>> 
>>> Are they bugs or I just misunderstand the instruction?  As the kernel 
>>> has
>>> already been stable for quite a while now, it is not likely to be a 
>>> critical
>>> bug.
>>> 
>>> Any clarification will highly appreciated.
> 
> Well, certainly from this it looks pretty broken -- and in a manner
> I'd expect to trigger frequently.  There are no fences in any of the
> other similar-looking implementations.
> 
> Maybe I'm missing something here?
> 

This is not something we can catch on QEMU and I think TileLink used on 
SiFive
boards can be coherent by design (TL-C) so maybe that's why you don't 
get any
failures.

BTW here is how arm does it on arm64 and I think their code is clean, 
instead of
sfence they use dsb, maybe we can get some idea on when we should fence.
https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/pgtable.h

However I have a question: If I get this right the weak memory model of 
RISC-V
allows us to do stuff on one core without having to notify the others, 
as long
as the code keeps running on the same core (it doesn't provide 
guarantees for
multi-core systems). Also when we have harts sharing the same cache we 
also
don't need to sfense, we also don't need to do it when caches are 
coherent
anyway and since cache coherency is implementation-specific maybe doing 
this for
everybody isn't the right approach. Sfence is expensive, it's done for 
the whole
memory and if we do it on every context switch it may introduce some 
unwanted
overhead.

Some ideas:

a) Have a function (maybe a function pointer on some 
platform/board-specific
struct, a relevant discussion here -> 
https://lkml.org/lkml/2018/10/31/431) that
implements sfence for this purpose if needed. If it's not needed it'll 
be a nop,
so chips that have coherent caches won't have to sfence on every page 
table update.

b) Use some feedback from the scheduler, for example if the process's 
cpu affinity
says that this process is pinned on a single core or that it's only 
scheduled on
cores that share their cache, we can again skip sfence.

Regards,
Nick

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-06 10:46     ` Nick Kossifidis
@ 2018-11-06 10:46       ` Nick Kossifidis
  0 siblings, 0 replies; 14+ messages in thread
From: Nick Kossifidis @ 2018-11-06 10:46 UTC (permalink / raw)
  To: Palmer Dabbelt; +Cc: linux-riscv, sw-dev, alankao, greentime

Στις 2018-11-06 04:33, Palmer Dabbelt έγραψε:
> Sorry, I missedy our original email.
> 
> On Sun, 04 Nov 2018 16:49:29 PST (-0800), alankao@andestech.com wrote:
>> Hi Palmer,
>> 
>> I believe the code in arch/riscv/mm/fault.c is mostly from you.
>> Do you have any comments on this?
>> 
>> On Thu, Nov 01, 2018 at 05:00:15PM +0800, Alan Kao wrote:
>>> Hi all,
>>> 
>>> As mentioned in the Privileged Spec about sfence.vma instruction:
>>> 
>>> > The supervisor memory-management fence instruction SFENCE.VMA is used
>>> > to synchronize updates to in-memory memory-management data structures
>>> > with current execution.  Instruction execution causes implicit reads
>>> > and writes to these data structures;  however, these implicit references
>>> > are ordinarily not ordered with respect to loads and stores in the instruction
>>> > stream.
>>> >
>>> > Executing an SFENCE.VMA instruction guarantees that any stores in the
>>> > instruction stream prior to the SFENCE.VMA are ordered before all implicit
>>> > references subsequent to the SFENCE.VMA.
>>> 
>>> It naturally follows that we should use sfence.vma once the page 
>>> table is
>>> modified.  There are several examples in the kernel already, such as
>>> 
>>> alloc_set_pte (in mm/memory.c):
>>> ...
>>>         set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry);
>>>         /* no need to invalidate: a not-present page won't be cached 
>>> */
>>>         update_mmu_cache(vma, vmf->address, vmf->pte);
>>> ...
>>> where the update_mmu_cache function eventually issues a sfence.vma.
>>> 
>>> I was interested if it is always the case and did some research.  
>>> RV64 uses
>>> 3-level of page table entry, pud, pmd and pte, so I traced a little 
>>> bit about
>>> the code flow after set_pud, set_pmd and set_pte.
>>> 
>>> It turns out that some of the calls to them are not followed by a
>>> sfence.vma.  For an instance, in the vmalloc_fault region in 
>>> do_page_fault,
>>> there is no sfence.vma or calls to it after set_pgd, which directs to 
>>> set_pud
>>> later.
> 
> This specific one looks like a bug: we're trying to fill out the page
> table for the vmalloc region, but we'll just continue trapping without
> an "sfence.vma".  The path between poking the page tables and the sret
> is pretty short and doesn't appear to ever have an "sfence.vma", so
> I'm not sure how this could work.
> 
>>> 
>>> Are they bugs or I just misunderstand the instruction?  As the kernel 
>>> has
>>> already been stable for quite a while now, it is not likely to be a 
>>> critical
>>> bug.
>>> 
>>> Any clarification will highly appreciated.
> 
> Well, certainly from this it looks pretty broken -- and in a manner
> I'd expect to trigger frequently.  There are no fences in any of the
> other similar-looking implementations.
> 
> Maybe I'm missing something here?
> 

This is not something we can catch on QEMU and I think TileLink used on 
SiFive
boards can be coherent by design (TL-C) so maybe that's why you don't 
get any
failures.

BTW here is how arm does it on arm64 and I think their code is clean, 
instead of
sfence they use dsb, maybe we can get some idea on when we should fence.
https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/pgtable.h

However I have a question: If I get this right the weak memory model of 
RISC-V
allows us to do stuff on one core without having to notify the others, 
as long
as the code keeps running on the same core (it doesn't provide 
guarantees for
multi-core systems). Also when we have harts sharing the same cache we 
also
don't need to sfense, we also don't need to do it when caches are 
coherent
anyway and since cache coherency is implementation-specific maybe doing 
this for
everybody isn't the right approach. Sfence is expensive, it's done for 
the whole
memory and if we do it on every context switch it may introduce some 
unwanted
overhead.

Some ideas:

a) Have a function (maybe a function pointer on some 
platform/board-specific
struct, a relevant discussion here -> 
https://lkml.org/lkml/2018/10/31/431) that
implements sfence for this purpose if needed. If it's not needed it'll 
be a nop,
so chips that have coherent caches won't have to sfence on every page 
table update.

b) Use some feedback from the scheduler, for example if the process's 
cpu affinity
says that this process is pinned on a single core or that it's only 
scheduled on
cores that share their cache, we can again skip sfence.

Regards,
Nick

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-06  8:03     ` Andreas Schwab
  2018-11-06  8:03       ` Andreas Schwab
@ 2018-11-07 15:51       ` Palmer Dabbelt
  2018-11-07 15:51         ` Palmer Dabbelt
  1 sibling, 1 reply; 14+ messages in thread
From: Palmer Dabbelt @ 2018-11-07 15:51 UTC (permalink / raw)
  To: linux-riscv

On Tue, 06 Nov 2018 00:03:19 PST (-0800), schwab at suse.de wrote:
> Perhaps that't the reason I sometimes get errors like this:
>
> [303420.500000] swap_info_get: Bad swap file entry 2000000000df09b1
> [303420.500000] BUG: Bad page map in process struct-ret-3.ex  pte:6f84d8c0 pmd:8f579001
> [303420.510000] addr:000000008a1610cd vm_flags:00100173 anon_vma:0000000036036fc9 mapping:          (null) index:3ffffe7
> [303420.520000] file:          (null) fault:          (null) mmap:          (null) readpage:          (null)
> [303420.530000] CPU: 1 PID: 2054 Comm: struct-ret-3.ex Not tainted 4.19.0-00040-g2d5ee99e76 #42
> [303420.530000] Call Trace:
> [303420.530000] [<ffffffe000c847d4>] walk_stackframe+0x0/0xa4
> [303420.530000] [<ffffffe000c849d4>] show_stack+0x2a/0x34
> [303420.530000] [<ffffffe0011a6800>] dump_stack+0x62/0x7c
> [303420.530000] [<ffffffe000d55942>] print_bad_pte+0x146/0x18e
> [303420.530000] [<ffffffe000d56df6>] unmap_page_range+0x33a/0x5ea
> [303420.530000] [<ffffffe000d570d4>] unmap_single_vma+0x2e/0x40
> [303420.530000] [<ffffffe000d57278>] unmap_vmas+0x42/0x7a
> [303420.530000] [<ffffffe000d5cb92>] exit_mmap+0x7e/0x106
> [303420.530000] [<ffffffe000c872aa>] mmput.part.2+0x26/0xa0
> [303420.530000] [<ffffffe000c87344>] mmput+0x20/0x28
> [303420.530000] [<ffffffe000c8ba24>] do_exit+0x238/0x7c0
> [303420.530000] [<ffffffe000c8c006>] do_group_exit+0x2a/0x82
> [303420.530000] [<ffffffe000c8c076>] __wake_up_parent+0x0/0x22
> [303420.530000] [<ffffffe000c83722>] ret_from_syscall+0x0/0xe

>From my understanding of it, this should manifest as an infinite loop:

* The implementation will raise an exception due to an unmapped page.
* The trap handler will then go fix up that exception and attempt to re-execute 
  the offending instruction.
* The implementation will raise the exception again because its translation caches 
  haven't been updated.

This requires an implementation that caches invalid mappings.  IIRC we don't do 
that in QEMU or in Rocket.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [sw-dev] About the Use of sfence.vma in Kernel
  2018-11-07 15:51       ` Palmer Dabbelt
@ 2018-11-07 15:51         ` Palmer Dabbelt
  0 siblings, 0 replies; 14+ messages in thread
From: Palmer Dabbelt @ 2018-11-07 15:51 UTC (permalink / raw)
  To: schwab; +Cc: linux-riscv, sw-dev, alankao, greentime

On Tue, 06 Nov 2018 00:03:19 PST (-0800), schwab@suse.de wrote:
> Perhaps that't the reason I sometimes get errors like this:
>
> [303420.500000] swap_info_get: Bad swap file entry 2000000000df09b1
> [303420.500000] BUG: Bad page map in process struct-ret-3.ex  pte:6f84d8c0 pmd:8f579001
> [303420.510000] addr:000000008a1610cd vm_flags:00100173 anon_vma:0000000036036fc9 mapping:          (null) index:3ffffe7
> [303420.520000] file:          (null) fault:          (null) mmap:          (null) readpage:          (null)
> [303420.530000] CPU: 1 PID: 2054 Comm: struct-ret-3.ex Not tainted 4.19.0-00040-g2d5ee99e76 #42
> [303420.530000] Call Trace:
> [303420.530000] [<ffffffe000c847d4>] walk_stackframe+0x0/0xa4
> [303420.530000] [<ffffffe000c849d4>] show_stack+0x2a/0x34
> [303420.530000] [<ffffffe0011a6800>] dump_stack+0x62/0x7c
> [303420.530000] [<ffffffe000d55942>] print_bad_pte+0x146/0x18e
> [303420.530000] [<ffffffe000d56df6>] unmap_page_range+0x33a/0x5ea
> [303420.530000] [<ffffffe000d570d4>] unmap_single_vma+0x2e/0x40
> [303420.530000] [<ffffffe000d57278>] unmap_vmas+0x42/0x7a
> [303420.530000] [<ffffffe000d5cb92>] exit_mmap+0x7e/0x106
> [303420.530000] [<ffffffe000c872aa>] mmput.part.2+0x26/0xa0
> [303420.530000] [<ffffffe000c87344>] mmput+0x20/0x28
> [303420.530000] [<ffffffe000c8ba24>] do_exit+0x238/0x7c0
> [303420.530000] [<ffffffe000c8c006>] do_group_exit+0x2a/0x82
> [303420.530000] [<ffffffe000c8c076>] __wake_up_parent+0x0/0x22
> [303420.530000] [<ffffffe000c83722>] ret_from_syscall+0x0/0xe

From my understanding of it, this should manifest as an infinite loop:

* The implementation will raise an exception due to an unmapped page.
* The trap handler will then go fix up that exception and attempt to re-execute 
  the offending instruction.
* The implementation will raise the exception again because its translation caches 
  haven't been updated.

This requires an implementation that caches invalid mappings.  IIRC we don't do 
that in QEMU or in Rocket.

_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2018-11-07 15:51 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-11-01  9:00 About the Use of sfence.vma in Kernel Alan Kao
2018-11-01  9:00 ` Alan Kao
2018-11-05  0:49 ` [sw-dev] " Alan Kao
2018-11-05  0:49   ` Alan Kao
2018-11-06  2:33   ` Palmer Dabbelt
2018-11-06  2:33     ` Palmer Dabbelt
2018-11-06  7:49     ` Alan Kao
2018-11-06  7:49       ` Alan Kao
2018-11-06  8:03     ` Andreas Schwab
2018-11-06  8:03       ` Andreas Schwab
2018-11-07 15:51       ` Palmer Dabbelt
2018-11-07 15:51         ` Palmer Dabbelt
2018-11-06 10:46     ` Nick Kossifidis
2018-11-06 10:46       ` Nick Kossifidis

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox