public inbox for linux-arm-kernel@lists.infradead.org
 help / color / mirror / Atom feed
* [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
@ 2023-04-27  3:26 Gang Li
  2023-04-27  7:30 ` Mark Rutland
  0 siblings, 1 reply; 8+ messages in thread
From: Gang Li @ 2023-04-27  3:26 UTC (permalink / raw)
  To: Will Deacon, Tomasz Nowicki, Laura Abbott
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Anshuman Khandual,
	Mark Rutland, Kefeng Wang, Feiyang Chen, linux-arm-kernel,
	linux-kernel

Hi all,

I have encountered a performance issue on our ARM64 machine, which seems
to be caused by the flush_tlb_kernel_range.

Here is the stack on the ARM64 machine:

# ARM64:
```
     ghes_unmap
     clear_fixmap
     __set_fixmap
     flush_tlb_kernel_range
```

As we can see, the ARM64 implementation eventually calls
flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
AMD64, the implementation calls flush_tlb_one_kernel instead.

# AMD64:
```
     ghes_unmap
     clear_fixmap
     __set_fixmap
     mmu.set_fixmap
     native_set_fixmap
     __native_set_fixmap
     set_pte_vaddr
     set_pte_vaddr_p4d
     __set_pte_vaddr
     flush_tlb_one_kernel
```

On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
performance degradation.

This arm64 patch said:
https://lore.kernel.org/all/20161201135112.15396-1-fu.wei@linaro.org/
(commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)

```
/*
  * Despite its name, this function must still broadcast the TLB
  * invalidation in order to ensure other CPUs don't end up with junk
  * entries as a result of speculation. Unusually, its also called in
  * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
  * TLB broadcasting, then we're in trouble here.
  */
static inline void arch_apei_flush_tlb_one(unsigned long addr)
{
     flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
```

1. I am curious to know the reason behind the design choice of flushing
the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
the TLB on a single core. Are there any TLB design details that make a
difference here?

2. Is it possible to let the ARM64 to flush the TLB on just one core,
similar to the AMD64?

3. If so, would there be any potential drawbacks or limitations to
making such a change?

Thanks,

Gang Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-16 11:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-27  3:26 [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush Gang Li
2023-04-27  7:30 ` Mark Rutland
2023-05-05  9:48   ` Gang Li
2023-05-05 12:28     ` Gang Li
2023-05-16  3:16       ` Gang Li
2023-05-06  2:51     ` Gang Li
     [not found]       ` <ZFpZAGeEXomG/eKS@FVFF77S0Q05N.cambridge.arm.com>
2023-05-16  7:47         ` Gang Li
2023-05-16 11:51           ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox