[QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush

Linux-ARM-Kernel Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
@ 2023-04-27  3:26 Gang Li
  2023-04-27  7:30 ` Mark Rutland
  0 siblings, 1 reply; 8+ messages in thread
From: Gang Li @ 2023-04-27  3:26 UTC (permalink / raw)
  To: Will Deacon, Tomasz Nowicki, Laura Abbott
  Cc: Catalin Marinas, Will Deacon, Ard Biesheuvel, Anshuman Khandual,
	Mark Rutland, Kefeng Wang, Feiyang Chen, linux-arm-kernel,
	linux-kernel

Hi all,

I have encountered a performance issue on our ARM64 machine, which seems
to be caused by the flush_tlb_kernel_range.

Here is the stack on the ARM64 machine:

# ARM64:
```
     ghes_unmap
     clear_fixmap
     __set_fixmap
     flush_tlb_kernel_range
```

As we can see, the ARM64 implementation eventually calls
flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
AMD64, the implementation calls flush_tlb_one_kernel instead.

# AMD64:
```
     ghes_unmap
     clear_fixmap
     __set_fixmap
     mmu.set_fixmap
     native_set_fixmap
     __native_set_fixmap
     set_pte_vaddr
     set_pte_vaddr_p4d
     __set_pte_vaddr
     flush_tlb_one_kernel
```

On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
performance degradation.

This arm64 patch said:
https://lore.kernel.org/all/20161201135112.15396-1-fu.wei@linaro.org/
(commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)

```
/*
  * Despite its name, this function must still broadcast the TLB
  * invalidation in order to ensure other CPUs don't end up with junk
  * entries as a result of speculation. Unusually, its also called in
  * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
  * TLB broadcasting, then we're in trouble here.
  */
static inline void arch_apei_flush_tlb_one(unsigned long addr)
{
     flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
}
```

1. I am curious to know the reason behind the design choice of flushing
the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
the TLB on a single core. Are there any TLB design details that make a
difference here?

2. Is it possible to let the ARM64 to flush the TLB on just one core,
similar to the AMD64?

3. If so, would there be any potential drawbacks or limitations to
making such a change?

Thanks,

Gang Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
  2023-04-27  3:26 [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush Gang Li
@ 2023-04-27  7:30 ` Mark Rutland
  2023-05-05  9:48   ` Gang Li
  0 siblings, 1 reply; 8+ messages in thread
From: Mark Rutland @ 2023-04-27  7:30 UTC (permalink / raw)
  To: Gang Li
  Cc: Will Deacon, Tomasz Nowicki, Laura Abbott, Catalin Marinas,
	Ard Biesheuvel, Anshuman Khandual, Kefeng Wang, Feiyang Chen,
	linux-arm-kernel, linux-kernel

Hi,

On Thu, Apr 27, 2023 at 11:26:50AM +0800, Gang Li wrote:
> Hi all,
> 
> I have encountered a performance issue on our ARM64 machine, which seems
> to be caused by the flush_tlb_kernel_range.

Can you please provide a few more details on what you're seeing?

What does your performance issue look like?

Are you sure that the performance issue is caused by flush_tlb_kernel_range()
specifically?

> Here is the stack on the ARM64 machine:
> 
> # ARM64:
> ```
>     ghes_unmap
>     clear_fixmap
>     __set_fixmap
>     flush_tlb_kernel_range
> ```
> 
> As we can see, the ARM64 implementation eventually calls
> flush_tlb_kernel_range, which flushes the TLB on all cores. However, on
> AMD64, the implementation calls flush_tlb_one_kernel instead.
> 
> # AMD64:
> ```
>     ghes_unmap
>     clear_fixmap
>     __set_fixmap
>     mmu.set_fixmap
>     native_set_fixmap
>     __native_set_fixmap
>     set_pte_vaddr
>     set_pte_vaddr_p4d
>     __set_pte_vaddr
>     flush_tlb_one_kernel
> ```
> 
> On our ARM64 machine, flush_tlb_kernel_range is causing a noticeable
> performance degradation.

As above, could you please provide more details on this?

> This arm64 patch said:
> https://lore.kernel.org/all/20161201135112.15396-1-fu.wei@linaro.org/
> (commit 9f9a35a7b654e006250530425eb1fb527f0d32e9)
> 
> ```
> /*
>  * Despite its name, this function must still broadcast the TLB
>  * invalidation in order to ensure other CPUs don't end up with junk
>  * entries as a result of speculation. Unusually, its also called in
>  * IRQ context (ghes_iounmap_irq) so if we ever need to use IPIs for
>  * TLB broadcasting, then we're in trouble here.
>  */
> static inline void arch_apei_flush_tlb_one(unsigned long addr)
> {
>     flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> }
> ```
> 
> 1. I am curious to know the reason behind the design choice of flushing
> the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
> the TLB on a single core. Are there any TLB design details that make a
> difference here?

I don't know why arm64 only clears this on a single CPU.

On arm64 we *must* invalidate the TLB on all CPUs as the kernel page tables are
shared by all CPUs, and the architectural Break-Before-Make rules in require
the TLB to be invalidated between two valid (but distinct) entries.

> 2. Is it possible to let the ARM64 to flush the TLB on just one core,
> similar to the AMD64?

No. If we omitted the broadcast TLB invalidation, then a different CPU may
fetch the old value into a TLB, then fetch the new value. When this happens,
the architecture permits "amalgamation", with UNPREDICTABLE results, which
could result in memory corruption, taking SErrors, etc.

> 3. If so, would there be any potential drawbacks or limitations to
> making such a change?

As above, we must use broadcast TLB invalidation here.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
  2023-04-27  7:30 ` Mark Rutland
@ 2023-05-05  9:48   ` Gang Li
  2023-05-05 12:28     ` Gang Li
  2023-05-06  2:51     ` Gang Li
  0 siblings, 2 replies; 8+ messages in thread
From: Gang Li @ 2023-05-05  9:48 UTC (permalink / raw)
  To: Mark Rutland, Gang Li
  Cc: Will Deacon, Tomasz Nowicki, Laura Abbott, Catalin Marinas,
	Ard Biesheuvel, Anshuman Khandual, Kefeng Wang, Feiyang Chen,
	linux-arm-kernel, linux-kernel

This series accidentally lost CC. Now I forward the lost emails to the
mailing list.

On 2023/4/28 17:27, Mark Rutland wrote:
> 
> 
> Hi,
> 
> Just to check -- did you mean to drop the other Ccs? It would be good to keep
> this discussion on-list if possible.
> 
> On Fri, Apr 28, 2023 at 01:49:46PM +0800, Gang Li wrote:
>> On 2023/4/27 15:30, Mark Rutland wrote:
>>> On Thu, Apr 27, 2023 at 11:26:50AM +0800, Gang Li wrote:
>>>> 1. I am curious to know the reason behind the design choice of flushing
>>>> the TLB on all cores for ARM64's clear_fixmap, while AMD64 only flushes
>>>> the TLB on a single core. Are there any TLB design details that make a
>>>> difference here?
>>>
>>> I don't know why arm64 only clears this on a single CPU.
>>
>> Sorry, I'm a bit confused.
>>
>> Did you mean you don't know why *amd64* only clears this on a single
>> CPU?
> 
> Yes, sorry; I meant to say "amd64" rather than "arm64" here.
> 
>> Looks like I should ask amd64 guy 😉
> 
> 😉
> 
>>> On arm64 we *must* invalidate the TLB on all CPUs as the kernel page tables are
>>> shared by all CPUs, and the architectural Break-Before-Make rules in require
>>> the TLB to be invalidated between two valid (but distinct) entries.
>>
>> ghes_unmap is protected by a spin_lock, so only one core can access this
>> mem area at a time. I understand that there will be no TLB for
>> this memory area on other cores.
>>
>> Is it because arm64 has speculative execution? Even if the core does not
>> hold the spin_lock, the TLB will still cache the critical section?
> 
> The architecture allows a CPU to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
> 
> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless. Consequently, the
> spinlock doesn't make any difference.
> 
> Thanks,
> Mark.
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
  2023-05-05  9:48   ` Gang Li
@ 2023-05-05 12:28     ` Gang Li
  2023-05-16  3:16       ` Gang Li
  2023-05-06  2:51     ` Gang Li
  1 sibling, 1 reply; 8+ messages in thread
From: Gang Li @ 2023-05-05 12:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Will Deacon, Tomasz Nowicki, Laura Abbott, Catalin Marinas,
	Ard Biesheuvel, Anshuman Khandual, Kefeng Wang, Feiyang Chen,
	linux-arm-kernel, linux-kernel, x86

Hi,

I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
different strategies for flushing tlb.

# arm64 call trace:
```
holding a spin lock
ghes_unmap
  clear_fixmap
   __set_fixmap
    flush_tlb_kernel_range
```

# x86 call trace:
```
holding a spin lock
ghes_unmap
  clear_fixmap
   __set_fixmap
    mmu.set_fixmap
     native_set_fixmap
      __native_set_fixmap
       set_pte_vaddr
        set_pte_vaddr_p4d
         __set_pte_vaddr
          flush_tlb_one_kernel
```

As we can see, ghes_unmap in arm64 eventually calls
flush_tlb_kernel_range to broadcast TLB invalidation. However, on
x86, ghes_unmap calls flush_tlb_one_kernel.

Why arm64 needs to broadcast TLB invalidation in ghes_unmap, while only
one CPU has accessed this memory area?

Mark Rutland said in 
https://lore.kernel.org/lkml/369d1be2-d418-1bfb-bfc2-b25e4e542d76@bytedance.com/

> The architecture (arm64) allows a CPU to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the 
> root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>
> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless.
> Consequently, the
> spinlock doesn't make any difference.
>

arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
allocated regardless of whether the CPU explicitly accesses memory.

Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
difference between x86 and arm64 in TLB allocation and invalidation 
strategy?

Thanks,
Gang Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
  2023-05-05 12:28     ` Gang Li
@ 2023-05-16  3:16       ` Gang Li
  0 siblings, 0 replies; 8+ messages in thread
From: Gang Li @ 2023-05-16  3:16 UTC (permalink / raw)
  To: x86, Thomas Gleixner
  Cc: Will Deacon, Tomasz Nowicki, Laura Abbott, Catalin Marinas,
	Ard Biesheuvel, Anshuman Khandual, Kefeng Wang, Feiyang Chen,
	linux-arm-kernel, linux-kernel

Hi all!

On 2023/5/5 20:28, Gang Li wrote:
> Hi,
> 
> I found that in `ghes_unmap` protected by spinlock, arm64 and x86 have
> different strategies for flushing tlb.
> 
> # arm64 call trace:
> ```
> holding a spin lock
> ghes_unmap
>   clear_fixmap
>    __set_fixmap
>     flush_tlb_kernel_range
> ```
> 
> # x86 call trace:
> ```
> holding a spin lock
> ghes_unmap
>   clear_fixmap
>    __set_fixmap
>     mmu.set_fixmap
>      native_set_fixmap
>       __native_set_fixmap
>        set_pte_vaddr
>         set_pte_vaddr_p4d
>          __set_pte_vaddr
>           flush_tlb_one_kernel
> ```
>
> arm64 broadcast TLB invalidation in ghes_unmap, because TLB entry can be
> allocated regardless of whether the CPU explicitly accesses memory.
> 
> Why doesn't x86 broadcast TLB invalidation in ghes_unmap? Is there any
> difference between x86 and arm64 in TLB allocation and invalidation 
> strategy?
> 

I found this in Intel® 64 and IA-32 Architectures Software Developer
Manuals:

> 4.10.2.3 Details of TLB Use
> Subject to the limitations given in the previous paragraph, the
> processor may cache a translation for any linear address, even if that
> address is not used to access memory. For example, the processor may
> cache translations required for prefetches and for accesses that result
> from speculative execution that would never actually occur in the
> executed code path.

Both x86 and arm64 can cache TLB for prefetches and speculative
execution. Then why are their flush policies different?

Thanks,
Gang Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
  2023-05-05  9:48   ` Gang Li
  2023-05-05 12:28     ` Gang Li
@ 2023-05-06  2:51     ` Gang Li
       [not found]       ` <ZFpZAGeEXomG/eKS@FVFF77S0Q05N.cambridge.arm.com>
  1 sibling, 1 reply; 8+ messages in thread
From: Gang Li @ 2023-05-06  2:51 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Will Deacon, Catalin Marinas, Ard Biesheuvel, Anshuman Khandual,
	Kefeng Wang, Feiyang Chen, linux-arm-kernel, linux-kernel

Hi,

On 2023/4/28 17:27, Mark Rutland wrote:> The architecture allows a CPU 
to allocate TLB entries at any time for any
> reason, for any valid translation table entries reachable from the 
> root in
> TTBR{0,1}_ELx. That can be due to speculation, prefetching, and/or other
> reasons.
>

TLB will be allocated due to prefetching or branch prediction. Will it
be invalidated when the prediction fails?

> Due to that, it doesn't matter whether or not a CPU explicitly accesses a
> memory location -- TLB entries can be allocated regardless. 
> Consequently, the
> spinlock doesn't make any difference.
>

And is there any kind of ARM manual or guide that
explains these details to help us programming better?

Thanks a lot for your help.
Gang Li

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

[parent not found: <ZFpZAGeEXomG/eKS@FVFF77S0Q05N.cambridge.arm.com>]

* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
       [not found]       ` <ZFpZAGeEXomG/eKS@FVFF77S0Q05N.cambridge.arm.com>
@ 2023-05-16  7:47         ` Gang Li
  2023-05-16 11:51           ` Mark Rutland
  0 siblings, 1 reply; 8+ messages in thread
From: Gang Li @ 2023-05-16  7:47 UTC (permalink / raw)
  To: Mark Rutland
  Cc: Will Deacon, Catalin Marinas, Ard Biesheuvel, Anshuman Khandual,
	Kefeng Wang, Feiyang Chen, linux-arm-kernel, linux-kernel

Hi,

On 2023/5/9 22:30, Mark Rutland wrote:
> For example, early in D8.13 we have the rule:
> 
> | R_SQBCS
> |
> |   When address translation is enabled, a translation table entry for an
> |   in-context translation regime that does not cause a Translation fault, an
> |   Address size fault, or an Access flag fault is permitted to be cached in a
> |   TLB or intermediate TLB caching structure as the result of an explicit or
> |   speculative access.
> 

Thanks a lot!

I looked up the x86 manual and found that the x86 TLB cache mechanism is
similar to arm64 (but the x86 guys haven't reply me yet):

Intel® 64 and IA-32 Architectures Software Developer Manuals:
> 4.10.2.3 Details of TLB Use
> Subject to the limitations given in the previous paragraph, the
> processor may cache a translation for any linear address, even if that
> address is not used to access memory. For example, the processor may
> cache translations required for prefetches and for accesses that result
> from speculative execution that would never actually occur in the
> executed code path.

Both architectures have similar TLB cache policies, why arm64 flush all
and x86 flush local in ghes_map and ghes_unmap?

I think flush all may be unnecessary.

1. Before accessing ghes data. Each CPU needs to call ghes_map, which
will create the mapping and flush their own TLb to make sure the current
CPU is using the latest mapping.

2. And there is no need to flush all in ghes_unmap, because the ghes_map
of other CPUs will flush their own TLBs before accessing the memory.

What do you think?

Thanks,
Gang Li.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush
  2023-05-16  7:47         ` Gang Li
@ 2023-05-16 11:51           ` Mark Rutland
  0 siblings, 0 replies; 8+ messages in thread
From: Mark Rutland @ 2023-05-16 11:51 UTC (permalink / raw)
  To: Gang Li
  Cc: Will Deacon, Catalin Marinas, Ard Biesheuvel, Anshuman Khandual,
	Kefeng Wang, Feiyang Chen, linux-arm-kernel, linux-kernel

On Tue, May 16, 2023 at 03:47:16PM +0800, Gang Li wrote:
> Hi,
> 
> On 2023/5/9 22:30, Mark Rutland wrote:
> > For example, early in D8.13 we have the rule:
> > 
> > | R_SQBCS
> > |
> > |   When address translation is enabled, a translation table entry for an
> > |   in-context translation regime that does not cause a Translation fault, an
> > |   Address size fault, or an Access flag fault is permitted to be cached in a
> > |   TLB or intermediate TLB caching structure as the result of an explicit or
> > |   speculative access.
> > 
> 
> Thanks a lot!
> 
> I looked up the x86 manual and found that the x86 TLB cache mechanism is
> similar to arm64 (but the x86 guys haven't reply me yet):
> 
> Intel® 64 and IA-32 Architectures Software Developer Manuals:
> > 4.10.2.3 Details of TLB Use
> > Subject to the limitations given in the previous paragraph, the
> > processor may cache a translation for any linear address, even if that
> > address is not used to access memory. For example, the processor may
> > cache translations required for prefetches and for accesses that result
> > from speculative execution that would never actually occur in the
> > executed code path.
> 
> Both architectures have similar TLB cache policies, why arm64 flush all
> and x86 flush local in ghes_map and ghes_unmap?
> 
> I think flush all may be unnecessary.
> 
> 1. Before accessing ghes data. Each CPU needs to call ghes_map, which
> will create the mapping and flush their own TLb to make sure the current
> CPU is using the latest mapping.
> 
> 2. And there is no need to flush all in ghes_unmap, because the ghes_map
> of other CPUs will flush their own TLBs before accessing the memory.

This is not sufficient. Regardless of whether CPUs *explicitly* access the VA
range, any CPU which can reach the live translation table entry is allowed to
fetch that and allocate it into a TLB at any time.

When a Break-Before-Make sequence isn't followed, the architecture permits a
number of resulting behaviours, including "amalgamation", where the TLB entries
are combined in some arbitrary IMPLEMENTATION DEFINED way. The architecture
isn't very clear here, but doesn't rule out two entries being combined such
that it generates an atbirary physical address and/or such tha the MMU thinks
the entry is from an intermediate walk. In either of those cases, the CPU might
speculative access device memory (which could change the state of the system,
or cause fatal SErrors), and/or allocate further junk into TLBs.

So per the architecture, broadcast maintenance is necessary on arm64.  The only
way to avoid it would be to have a local set of translation tables which are
not shared with other CPUs.

I suspect x86 might not have the same issue with amalgamation.

Thanks,
Mark.

_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2023-05-16 11:52 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-27  3:26 [QUESTION FOR ARM64 TLB] performance issue and implementation difference of TLB flush Gang Li
2023-04-27  7:30 ` Mark Rutland
2023-05-05  9:48   ` Gang Li
2023-05-05 12:28     ` Gang Li
2023-05-16  3:16       ` Gang Li
2023-05-06  2:51     ` Gang Li
     [not found]       ` <ZFpZAGeEXomG/eKS@FVFF77S0Q05N.cambridge.arm.com>
2023-05-16  7:47         ` Gang Li
2023-05-16 11:51           ` Mark Rutland

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox