* [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
@ 2023-04-27 3:26 Gang Li
2023-04-27 7:30 ` Mark Rutland
0 siblings, 1 reply; 8+ messages in thread
From: Gang Li @ 2023-04-27 3:26 UTC (permalink / raw)
To: Will Deacon, Tomasz Nowicki, Laura Abbott
Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Anshuman Khandual,
Mark Rutland, Kefeng Wang, Feiyang Chen, linux-arm-kernel,
linux-kernel
Hi all,
I have encountered a performance issue on our ARM64 machine, which seems
to be caused by the flush_tlb_kernel_range.
Here is the stack on the ARM64 machine:
# ARM64:
```
ghes_unmap
clear_fixmap
__set_fixmap
flush_tlb_kernel_range
```
As we can see, the ARM64 implementation eventually calls
flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
AMD64, the implementation calls flush_tlb_one_kernel instead.
# AMD64:
```
ghes_unmap
clear_fixmap
__set_fixmap
mmu.set_fixmap
native_set_fixmap
__native_set_fixmap
set_pte_vaddr
set_pte_vaddr_p4d
__set_pte_vaddr
flush_tlb_one_kernel
```
On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
performance degradation.
This arm64 patch said:
https://lore.kernel.org/all/20161201135112.15396-1-fu.wei@linaro.org/
(commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)
```
/*
* Despite its name, this function must still broadcast the TLB
* invalidation in order to ensure other CPUs don't end up with junk
* entries as a result of speculation. Unusually, its also called in
* IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
* TLB broadcasting, then we're in trouble here.
*/
static inline void arch_apei_flush_tlb_one(unsigned long addr)
{
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
```
1. I am curious to know the reason behind the design choice of flushing
the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
the TLB on a single core. Are there any TLB design details that make a
difference here?
2. Is it possible to let the ARM64 to flush the TLB on just one core,
similar to the AMD64?
3. If so, would there be any potential drawbacks or limitations to
making such a change?
Thanks,
Gang Li
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
2023-04-27 3:26 [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush Gang Li
@ 2023-04-27 7:30 ` Mark Rutland
2023-05-05 9:48 ` Gang Li
0 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2023-04-27 7:30 UTC (permalink / raw)
To: Gang Li
Cc: Will Deacon, Tomasz Nowicki, Laura Abbott, Catalin Marinas,
Ard Biesheuvel, Anshuman Khandual, Kefeng Wang, Feiyang Chen,
linux-arm-kernel, linux-kernel
Hi,
On Thu, Apr 27, 2023 at 11:26:50AM +0800, Gang Li wrote:
> Hi all,
>
> I have encountered a performance issue on our ARM64 machine, which seems
> to be caused by the flush_tlb_kernel_range.
Can you please provide a few more details on what you're seeing?
What does your performance issue look like?
Are you sure that the performance issue is caused by flush_tlb_kernel_range()
specifically?
> Here is the stack on the ARM64 machine:
>
> # ARM64:
> ```
> ghes_unmap
> clear_fixmap
> __set_fixmap
> flush_tlb_kernel_range
> ```
>
> As we can see, the ARM64 implementation eventually calls
> flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
> AMD64, the implementation calls flush_tlb_one_kernel instead.
>
> # AMD64:
> ```
> ghes_unmap
> clear_fixmap
> __set_fixmap
> mmu.set_fixmap
> native_set_fixmap
> __native_set_fixmap
> set_pte_vaddr
> set_pte_vaddr_p4d
> __set_pte_vaddr
> flush_tlb_one_kernel
> ```
>
> On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
> performance degradation.
As above, could you please provide more details on this?
> This arm64 patch said:
> https://lore.kernel.org/all/20161201135112.15396-1-fu.wei@linaro.org/
> (commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)
>
> ```
> /*
> * Despite its name, this function must still broadcast the TLB
> * invalidation in order to ensure other CPUs don't end up with junk
> * entries as a result of speculation. Unusually, its also called in
> * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
> * TLB broadcasting, then we're in trouble here.
> */
> static inline void arch_apei_flush_tlb_one(unsigned long addr)
> {
> flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> }
> ```
>
> 1. I am curious to know the reason behind the design choice of flushing
> the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
> the TLB on a single core. Are there any TLB design details that make a
> difference here?
I don't know why arm64 only clears this on a single CPU.
On arm64 we *must* invalidate the TLB on all CPUs as the kernel page tables are
shared by all CPUs, and the architectural Break-Before-Make rules in require
the TLB to be invalidated between two valid (but distinct) entries.
> 2. Is it possible to let the ARM64 to flush the TLB on just one core,
> similar to the AMD64?
No. If we omitted the broadcast TLB invalidation, then a different CPU may
fetch the old value into a TLB, then fetch the new value. When this happens,
the architecture permits "amalgamation", with UNPREDICTABLE results, which
could result in memory corruption, taking SErrors, etc.
> 3. If so, would there be any potential drawbacks or limitations to
> making such a change?
As above, we must use broadcast TLB invalidation here.
Thanks,
Mark.
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
2023-04-27 7:30 ` Mark Rutland
@ 2023-05-05 9:48 ` Gang Li
2023-05-05 12:28 ` Gang Li
2023-05-06 2:51 ` Gang Li
0 siblings, 2 replies; 8+ messages in thread
From: Gang Li @ 2023-05-05 9:48 UTC (permalink / raw)
To: Mark Rutland, Gang Li
Cc: Will Deacon, Tomasz Nowicki, Laura Abbott, Catalin Marinas,
Ard Biesheuvel, Anshuman Khandual, Kefeng Wang, Feiyang Chen,
linux-arm-kernel, linux-kernel
This series accidentally lost CC. Now I forward the lost emails to the
mailing list.
On 2023/4/28 17:27, Mark Rutland wrote:
>
>
> Hi,
>
> Just to check -- did you mean to drop the other Ccs? It would be good to keep
> this discussion on-list if possible.
>
> On Fri, Apr 28, 2023 at 01:49:46PM +0800, Gang Li wrote:
>> On 2023/4/27 15:30, Mark Rutland wrote:
>>> On Thu, Apr 27, 2023 at 11:26:50AM +0800, Gang Li wrote:
>>>> 1. I am curious to know the reason behind the design choice of flushing
>>>> the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
>>>> the TLB on a single core. Are there any TLB design details that make a
>>>> difference here?
>>>
>>> I don't know why arm64 only clears this on a single CPU.
>>
>> Sorry, I'm a bit confused.
>>
>> Did you mean you don't know why *amd64* only clears this on a single
>> CPU?
>
> Yes, sorry; I meant to say "amd64" rather than "arm64" here.
>
>> Looks like I should ask amd64 guy 😉
>
> 😉
>
>>> On arm64 we *must* invalidate the TLB on all CPUs as the kernel page tables are
>>> shared by all CPUs, and the architectural Break-Before-Make rules in require
>>> the TLB to be invalidated between two valid (but distinct) entries.
>>
>> ghes_unmap is protected by a spin_lock, so only one core can access this
>> mem area at a time. I understand that there will be no TLB for
>> this memory area on other cores.
>>
>> Is it because arm64 has speculative execution? Even if the core does not
>> hold the spin_lock, the TLB will still cache the critical section?
>
> The architecture allows a CPU to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>
> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless. Consequently, the
> spinlock doesn't make any difference.
>
> Thanks,
> Mark.
>
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
2023-05-05 9:48 ` Gang Li
@ 2023-05-05 12:28 ` Gang Li
2023-05-16 3:16 ` Gang Li
2023-05-06 2:51 ` Gang Li
1 sibling, 1 reply; 8+ messages in thread
From: Gang Li @ 2023-05-05 12:28 UTC (permalink / raw)
To: Thomas Gleixner
Cc: Will Deacon, Tomasz Nowicki, Laura Abbott, Catalin Marinas,
Ard Biesheuvel, Anshuman Khandual, Kefeng Wang, Feiyang Chen,
linux-arm-kernel, linux-kernel, x86
Hi,
I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
different strategies for flushing tlb.
# arm64 call trace:
```
holding a spin lock
ghes_unmap
clear_fixmap
__set_fixmap
flush_tlb_kernel_range
```
# x86 call trace:
```
holding a spin lock
ghes_unmap
clear_fixmap
__set_fixmap
mmu.set_fixmap
native_set_fixmap
__native_set_fixmap
set_pte_vaddr
set_pte_vaddr_p4d
__set_pte_vaddr
flush_tlb_one_kernel
```
As we can see, ghes_unmap in arm64 eventually calls
flush_tlb_kernel_range to broadcast TLB invalidation. However, on
x86, ghes_unmap calls flush_tlb_one_kernel.
Why arm64 needs to broadcast TLB invalidation in ghes_unmap, while only
one CPU has accessed this memory area?
Mark Rutland said in
https://lore.kernel.org/lkml/369d1be2-d418-1bfb-bfc2-b25e4e542d76@bytedance.com/
> The architecture (arm64) allows a CPU to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the
> root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>
> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless.
> Consequently, the
> spinlock doesn't make any difference.
>
arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
allocated regardless of whether the CPU explicitly accesses memory.
Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
difference between x86 and arm64 in TLB allocation and invalidation
strategy?
Thanks,
Gang Li
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
2023-05-05 12:28 ` Gang Li
@ 2023-05-16 3:16 ` Gang Li
0 siblings, 0 replies; 8+ messages in thread
From: Gang Li @ 2023-05-16 3:16 UTC (permalink / raw)
To: x86, Thomas Gleixner
Cc: Will Deacon, Tomasz Nowicki, Laura Abbott, Catalin Marinas,
Ard Biesheuvel, Anshuman Khandual, Kefeng Wang, Feiyang Chen,
linux-arm-kernel, linux-kernel
Hi all!
On 2023/5/5 20:28, Gang Li wrote:
> Hi,
>
> I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
> different strategies for flushing tlb.
>
> # arm64 call trace:
> ```
> holding a spin lock
> ghes_unmap
> clear_fixmap
> __set_fixmap
> flush_tlb_kernel_range
> ```
>
> # x86 call trace:
> ```
> holding a spin lock
> ghes_unmap
> clear_fixmap
> __set_fixmap
> mmu.set_fixmap
> native_set_fixmap
> __native_set_fixmap
> set_pte_vaddr
> set_pte_vaddr_p4d
> __set_pte_vaddr
> flush_tlb_one_kernel
> ```
>
> arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
> allocated regardless of whether the CPU explicitly accesses memory.
>
> Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
> difference between x86 and arm64 in TLB allocation and invalidation
> strategy?
>
I found this in Intel® 64 and IA-32 Architectures Software Developer
Manuals:
> 4.10.2.3 Details of TLB Use
> Subject to the limitations given in the previous paragraph, the
> processor may cache a translation for any linear address, even if that
> address is not used to access memory. For example, the processor may
> cache translations required for prefetches and for accesses that result
> from speculative execution that would never actually occur in the
> executed code path.
Both x86 and arm64 can cache TLB for prefetches and speculative
execution. Then why are their flush policies different?
Thanks,
Gang Li
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
2023-05-05 9:48 ` Gang Li
2023-05-05 12:28 ` Gang Li
@ 2023-05-06 2:51 ` Gang Li
[not found] ` <ZFpZAGeEXomG/eKS@FVFF77S0Q05N.cambridge.arm.com>
1 sibling, 1 reply; 8+ messages in thread
From: Gang Li @ 2023-05-06 2:51 UTC (permalink / raw)
To: Mark Rutland
Cc: Will Deacon, Catalin Marinas, Ard Biesheuvel, Anshuman Khandual,
Kefeng Wang, Feiyang Chen, linux-arm-kernel, linux-kernel
Hi,
On 2023/4/28 17:27, Mark Rutland wrote:> The architecture allows a CPU
to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the
> root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>
TLB will be allocated due to prefetching or branch prediction. Will it
be invalidated when the prediction fails?
> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless.
> Consequently, the
> spinlock doesn't make any difference.
>
And is there any kind of ARM manual or guide that
explains these details to help us programming better?
Thanks a lot for your help.
Gang Li
_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2023-05-16 11:52 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-27 3:26 [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush Gang Li
2023-04-27 7:30 ` Mark Rutland
2023-05-05 9:48 ` Gang Li
2023-05-05 12:28 ` Gang Li
2023-05-16 3:16 ` Gang Li
2023-05-06 2:51 ` Gang Li
[not found] ` <ZFpZAGeEXomG/eKS@FVFF77S0Q05N.cambridge.arm.com>
2023-05-16 7:47 ` Gang Li
2023-05-16 11:51 ` Mark Rutland
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox