public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dylan Jhong <dylan@andestech.com>
To: Alexandre Ghiti <alexghiti@rivosinc.com>
Cc: "Conor Dooley" <conor.dooley@microchip.com>,
	"Guo Ren" <guoren@kernel.org>,
	"Paul Walmsley" <paul.walmsley@sifive.com>,
	"Palmer Dabbelt" <palmer@dabbelt.com>,
	"Albert Ou" <aou@eecs.berkeley.edu>,
	"Björn Töpel" <bjorn@rivosinc.com>,
	linux-riscv@lists.infradead.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH -fixes] riscv: Implement flush_cache_vmap()
Date: Fri, 4 Aug 2023 15:48:05 +0800	[thread overview]
Message-ID: <ZMytNY2J8iyjbPPy@atctrx.andestech.com> (raw)
In-Reply-To: <CAHVXubiz-7LaxCJLW=-ekr7TBFswXojr1ODU4mo59Z1OBmjieg@mail.gmail.com>

On Thu, Aug 03, 2023 at 11:48:36AM +0200, Alexandre Ghiti wrote:
> On Thu, Aug 3, 2023 at 11:25 AM Conor Dooley <conor.dooley@microchip.com> wrote:
> >
> > On Thu, Aug 03, 2023 at 05:14:15PM +0800, dylan wrote:
> > > On Sun, Jul 30, 2023 at 01:08:17AM -0400, Guo Ren wrote:
> > > > On Tue, Jul 25, 2023 at 9:22 AM Alexandre Ghiti <alexghiti@rivosinc.com> wrote:
> > > > >
> > > > > The RISC-V kernel needs a sfence.vma after a page table modification: we
> > > > > used to rely on the vmalloc fault handling to emit an sfence.vma, but
> > > > > commit 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for
> > > > > vmalloc/modules area") got rid of this path for 64-bit kernels, so now we
> > > > > need to explicitly emit a sfence.vma in flush_cache_vmap().
> > > > >
> > > > > Note that we don't need to implement flush_cache_vunmap() as the generic
> > > > > code should emit a flush tlb after unmapping a vmalloc region.
> > > > >
> > > > > Fixes: 7d3332be011e ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area")
> > > > > Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com>
> > > > > ---
> > > > >  arch/riscv/include/asm/cacheflush.h | 4 ++++
> > > > >  1 file changed, 4 insertions(+)
> > > > >
> > > > > diff --git a/arch/riscv/include/asm/cacheflush.h b/arch/riscv/include/asm/cacheflush.h
> > > > > index 8091b8bf4883..b93ffddf8a61 100644
> > > > > --- a/arch/riscv/include/asm/cacheflush.h
> > > > > +++ b/arch/riscv/include/asm/cacheflush.h
> > > > > @@ -37,6 +37,10 @@ static inline void flush_dcache_page(struct page *page)
> > > > >  #define flush_icache_user_page(vma, pg, addr, len) \
> > > > >         flush_icache_mm(vma->vm_mm, 0)
> > > > >
> > > > > +#ifdef CONFIG_64BIT
> > > > > +#define flush_cache_vmap(start, end)   flush_tlb_kernel_range(start, end)
> > > > Sorry, I couldn't agree with the above in a PIPT cache machine. It's
> > > > not worth for.
> > > >
> > > > It would reduce the performance of vmap_pages_range,
> > > > ioremap_page_range ... API, which may cause some drivers' performance
> > > > issues when they install/uninstall memory frequently.
> > > >
> > >
> > > Hi All,
> > >
> > > I think functional correctness should be more important than system performance
> > > in this case. The "preventive" SFENCE.VMA became necessary due to the RISC-V
> > > specification allowing invalidation entries to be cached in the TLB.
> >
> > We are at -rc4 and this stuff is broken. Taking the bigger hammer, which
> > can be reverted later when a more targeted fix shows up, to make sure
> > that v6.5 doesn't end up broken, sounds rather prudent. Otherwise, the
> > original commit should probably be reverted.
> 
> The original commit that removed vmalloc_fault() is required, handling
> vmalloc faults in the page fault path is not possible (see the links
> in the description of 7d3332be011e and the example that I gave in the
> thread https://lore.kernel.org/linux-riscv/dc26625b-6658-c078-76d2-7e975a04b1d4@ghiti.fr/).
> 
> I totally agree with Dylan that we'll work (I'm currently working on
> that) on the performance side of the problem in the next release, we
> need correctness and for that we need a preventive global sfence.vma
> as we have no means (for now) to distinguish between uarch that cache
> or not invalid entries.
> 
> >
> > > The problem[1] we are currently encountering is caused by not updating the TLB
> > > after the page table is created, and the solution to this problem can only be
> > > solved by updating the TLB immediately after the page table is created.
> > >
> > > There are currently two possible approaches to flush TLB:
> > > 1. Flush TLB in flush_cache_vmap()
> > > 2. Flush TLB in arch_sync_kernel_mappings()
> > >
> > > But I'm not quite sure if it's a good idea to operate on the TLB inside flush_cache_vmap().
> > > The name of this function indicates that it should be related to cache operations, maybe
> > > it would be more appropriate to do TLB flush in arch_sync_kernel_mappings()?
> 
> TLDR: The downsides to implementing arch_sync_kernel_mappings()
> instead of flush_cache_vmap():
> 
> - 2 global flushes for vunmap instead of 1 for flush_cache_vmap()
> - flushes the tlb in the noflush suffixed functions so it prevents any
> flush optimization (ie: a loop of vmap_range_noflush() without flush
> and then a final flush afterwards)
> 
> So I'd favour the flush_cache_vmap() implementation which seems
> lighter. powerpc does that
> https://elixir.bootlin.com/linux/latest/source/arch/powerpc/include/asm/cacheflush.h#L27
> (but admits that it may not be the right place)
> 
> Here is the long story (my raw notes):
> 
> * arch_sync_kernel_mappings() is called from:
> - _apply_to_page_range(): would only emit global sfence.vma if vmalloc
> addresses, I guess that's ok.
> - __vunmap_range_noflush(): it is noted here
> https://elixir.bootlin.com/linux/latest/source/mm/vmalloc.c#L406 that
> any caller must call flush_tlb_kernel_range(). Then the implementation
> of arch_sync_kernel_mappings() would result in 2 global tlb flushes.
> - vmap_range_noflush(): does not fit well with the noflush() suffix.
> 
> * flush_cache_vmap() is called from:
> - kasan_populate_vmalloc(): legit since it bypasses vmap api (but
> called right a apply_to_page_range() so your patch would work here)
> - kmsan_vunmap_range_noflush(): called twice for the mappings kmsan
> establishes and flush_tlb_kernel_range() must be called afterwards =>
> 3 global tlb flushes but the 3 are needed as they target different
> addresses. Implementing only arch_sync_kernel_mappings() would result
> in way more global flushes (see the loop here
> https://elixir.bootlin.com/linux/latest/source/mm/kmsan/hooks.c#L151
> where  __vmap_pages_range_noflush() would result in more
> flush_tlb_all())
> - kmsan_vmap_pages_range_noflush(): here we would flush twice, but
> same thing for the arch_sync_kernel_mappings() implementation.
> - ioremap_page_range(): legit, same as arch_sync_kernel_mappings()
> implementation.
> - vmap_pages_range(): legit, same as arch_sync_kernel_mappings() implementation.
> 
> Let me know what you think!
> 
> Alex
> 
Hi Alex,

Thank you for the detailed explanation. It is indeed undeniable that in certain
situations, there might be a possibility of repeated flushing TLB. But I think
there are some potential problem in flush_cache_vmap().

In most case, vmap_range_noflush() and flush_cache_vmap() will appear at the same
time, so it should be no problem to choose one of them to do the TLB flush. But
flush_cache_vmap() does not cover all the places where apply_to_page_range()
appears (please correct me if I'm wrong), such as vmap_pfn()[1].

The function you mentioned here, each will eventually call:
    vmap_range_noflush() -> arch_sync_kernel_mappings() -> TLB Flush

As for the performance, because the current parameter of flush_tlb_page() needs to
pass *vma, we cannot pass in this parameter so we can only choose flush_tlb_all().
If it can be changed to flush_tlb_page() in the future, the performance should be improved.

[1]: https://elixir.bootlin.com/linux/v6.5-rc4/source/mm/vmalloc.c#L2977

Best regards,
Dylan Jhong

> > >
> > > [1]: http://lists.infradead.org/pipermail/linux-riscv/2023-August/037503.html
> >

  parent reply	other threads:[~2023-08-04  7:48 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-25 13:22 [PATCH -fixes] riscv: Implement flush_cache_vmap() Alexandre Ghiti
2023-07-30  5:08 ` Guo Ren
2023-08-03  9:14   ` dylan
2023-08-03  9:24     ` Conor Dooley
2023-08-03  9:48       ` Alexandre Ghiti
2023-08-03  9:58         ` Alexandre Ghiti
2023-08-03 10:05           ` Conor Dooley
2023-08-04  7:48         ` Dylan Jhong [this message]
2023-08-08 11:23           ` Alexandre Ghiti
2023-08-08 23:55             ` Guo Ren
2023-08-10 16:00               ` Palmer Dabbelt
2023-08-10 16:22                 ` Dylan Jhong
2023-08-10 19:39                   ` Palmer Dabbelt
2023-08-08 23:47     ` Guo Ren
2023-08-10 19:50 ` patchwork-bot+linux-riscv

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZMytNY2J8iyjbPPy@atctrx.andestech.com \
    --to=dylan@andestech.com \
    --cc=alexghiti@rivosinc.com \
    --cc=aou@eecs.berkeley.edu \
    --cc=bjorn@rivosinc.com \
    --cc=conor.dooley@microchip.com \
    --cc=guoren@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox