tlbi va, vaa vs. val, vaal

linux-arm-kernel.lists.infradead.org archive mirror
 help / color / mirror / Atom feed

* tlbi  va, vaa vs. val, vaal
@ 2015-02-27  0:12 Mario Smarduch
  2015-02-27 10:24 ` Will Deacon
  0 siblings, 1 reply; 8+ messages in thread
From: Mario Smarduch @ 2015-02-27  0:12 UTC (permalink / raw)
  To: linux-arm-kernel

I noticed kernel tlbflush.h use tlbi va*, vaa* variants instead of
val, vaal ones. Reading the manual D.5.7.2 it appears that
va*, vaa* versions invalidate intermediate caching of
translation structures.

With stage2 enabled that may result in 20+ memory lookups
for a 4 level page table walk. That's assuming that intermediate
caching structures cache mappings from stage1 table entry to
host page.

- Mario

^ permalink raw reply	[flat|nested] 8+ messages in thread

* tlbi  va, vaa vs. val, vaal
  2015-02-27  0:12 tlbi va, vaa vs. val, vaal Mario Smarduch
@ 2015-02-27 10:24 ` Will Deacon
  2015-02-27 10:29   ` Marc Zyngier
  2015-02-27 21:15   ` Mario Smarduch
  0 siblings, 2 replies; 8+ messages in thread
From: Will Deacon @ 2015-02-27 10:24 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 27, 2015 at 12:12:32AM +0000, Mario Smarduch wrote:
> I noticed kernel tlbflush.h use tlbi va*, vaa* variants instead of
> val, vaal ones. Reading the manual D.5.7.2 it appears that
> va*, vaa* versions invalidate intermediate caching of
> translation structures.
> 
> With stage2 enabled that may result in 20+ memory lookups
> for a 4 level page table walk. That's assuming that intermediate
> caching structures cache mappings from stage1 table entry to
> host page.

Yeah, Catalin and I discussed improving the kernel support for this,
but it requires some changes to the generic mmu_gather code so that we
can distinguish the leaf cases. I'd also like to see that done in a way
that takes into account different granule sizes (we currently iterate
over huge pages in 4k chunks). Last time I touched that, I entered a
world of pain and don't plan to return there immediately :)

Catalin -- feeling brave?

FWIW: the new IOMMU page-table stuff I just got merged *does* make use
of leaf-invalidation for the SMMU.

Will

^ permalink raw reply	[flat|nested] 8+ messages in thread

* tlbi  va, vaa vs. val, vaal
  2015-02-27 10:24 ` Will Deacon
@ 2015-02-27 10:29   ` Marc Zyngier
  2015-02-27 10:33     ` Will Deacon
  2015-02-27 21:15   ` Mario Smarduch
  1 sibling, 1 reply; 8+ messages in thread
From: Marc Zyngier @ 2015-02-27 10:29 UTC (permalink / raw)
  To: linux-arm-kernel

On 27/02/15 10:24, Will Deacon wrote:
> On Fri, Feb 27, 2015 at 12:12:32AM +0000, Mario Smarduch wrote:
>> I noticed kernel tlbflush.h use tlbi va*, vaa* variants instead of
>> val, vaal ones. Reading the manual D.5.7.2 it appears that
>> va*, vaa* versions invalidate intermediate caching of
>> translation structures.
>>
>> With stage2 enabled that may result in 20+ memory lookups
>> for a 4 level page table walk. That's assuming that intermediate
>> caching structures cache mappings from stage1 table entry to
>> host page.
> 
> Yeah, Catalin and I discussed improving the kernel support for this,
> but it requires some changes to the generic mmu_gather code so that we
> can distinguish the leaf cases. I'd also like to see that done in a way
> that takes into account different granule sizes (we currently iterate
> over huge pages in 4k chunks). Last time I touched that, I entered a
> world of pain and don't plan to return there immediately :)
> 
> Catalin -- feeling brave?
> 
> FWIW: the new IOMMU page-table stuff I just got merged *does* make use
> of leaf-invalidation for the SMMU.

Now, talking about feeling brave: who will be silly enough to port KVM
to the IOMMU page table code? It should just work(tm), right?

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* tlbi  va, vaa vs. val, vaal
  2015-02-27 10:29   ` Marc Zyngier
@ 2015-02-27 10:33     ` Will Deacon
  2015-02-27 10:44       ` Marc Zyngier
  0 siblings, 1 reply; 8+ messages in thread
From: Will Deacon @ 2015-02-27 10:33 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 27, 2015 at 10:29:06AM +0000, Marc Zyngier wrote:
> On 27/02/15 10:24, Will Deacon wrote:
> > On Fri, Feb 27, 2015 at 12:12:32AM +0000, Mario Smarduch wrote:
> >> I noticed kernel tlbflush.h use tlbi va*, vaa* variants instead of
> >> val, vaal ones. Reading the manual D.5.7.2 it appears that
> >> va*, vaa* versions invalidate intermediate caching of
> >> translation structures.
> >>
> >> With stage2 enabled that may result in 20+ memory lookups
> >> for a 4 level page table walk. That's assuming that intermediate
> >> caching structures cache mappings from stage1 table entry to
> >> host page.
> > 
> > Yeah, Catalin and I discussed improving the kernel support for this,
> > but it requires some changes to the generic mmu_gather code so that we
> > can distinguish the leaf cases. I'd also like to see that done in a way
> > that takes into account different granule sizes (we currently iterate
> > over huge pages in 4k chunks). Last time I touched that, I entered a
> > world of pain and don't plan to return there immediately :)
> > 
> > Catalin -- feeling brave?
> > 
> > FWIW: the new IOMMU page-table stuff I just got merged *does* make use
> > of leaf-invalidation for the SMMU.
> 
> Now, talking about feeling brave: who will be silly enough to port KVM
> to the IOMMU page table code? It should just work(tm), right?

I suspect you'll need to do some surgery to the interfaces, which currently
map directly onto the IOMMU API and therefore make nice assumptions about
what we get asked to map/unmap. You also probably want a wider range of
permissions than we use on the SMMU. Finally, the runtime nature of the
code (we make no assumptions about address sizes, page sizes etc) probably
incurs a performance hit that you may or may not care about.

Will

^ permalink raw reply	[flat|nested] 8+ messages in thread

* tlbi  va, vaa vs. val, vaal
  2015-02-27 10:33     ` Will Deacon
@ 2015-02-27 10:44       ` Marc Zyngier
  0 siblings, 0 replies; 8+ messages in thread
From: Marc Zyngier @ 2015-02-27 10:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 27/02/15 10:33, Will Deacon wrote:
> On Fri, Feb 27, 2015 at 10:29:06AM +0000, Marc Zyngier wrote:
>> On 27/02/15 10:24, Will Deacon wrote:
>>> On Fri, Feb 27, 2015 at 12:12:32AM +0000, Mario Smarduch wrote:
>>>> I noticed kernel tlbflush.h use tlbi va*, vaa* variants instead of
>>>> val, vaal ones. Reading the manual D.5.7.2 it appears that
>>>> va*, vaa* versions invalidate intermediate caching of
>>>> translation structures.
>>>>
>>>> With stage2 enabled that may result in 20+ memory lookups
>>>> for a 4 level page table walk. That's assuming that intermediate
>>>> caching structures cache mappings from stage1 table entry to
>>>> host page.
>>>
>>> Yeah, Catalin and I discussed improving the kernel support for this,
>>> but it requires some changes to the generic mmu_gather code so that we
>>> can distinguish the leaf cases. I'd also like to see that done in a way
>>> that takes into account different granule sizes (we currently iterate
>>> over huge pages in 4k chunks). Last time I touched that, I entered a
>>> world of pain and don't plan to return there immediately :)
>>>
>>> Catalin -- feeling brave?
>>>
>>> FWIW: the new IOMMU page-table stuff I just got merged *does* make use
>>> of leaf-invalidation for the SMMU.
>>
>> Now, talking about feeling brave: who will be silly enough to port KVM
>> to the IOMMU page table code? It should just work(tm), right?
> 
> I suspect you'll need to do some surgery to the interfaces, which currently
> map directly onto the IOMMU API and therefore make nice assumptions about
> what we get asked to map/unmap. You also probably want a wider range of
> permissions than we use on the SMMU. Finally, the runtime nature of the
> code (we make no assumptions about address sizes, page sizes etc) probably
> incurs a performance hit that you may or may not care about.

That's exactly what I want to evaluate. It would also help us to
decouple our page-table code from the kernel macros, which bite us time
and time again...

Thanks,

	M.
-- 
Jazz is not dead. It just smells funny...

^ permalink raw reply	[flat|nested] 8+ messages in thread

* tlbi  va, vaa vs. val, vaal
  2015-02-27 10:24 ` Will Deacon
  2015-02-27 10:29   ` Marc Zyngier
@ 2015-02-27 21:15   ` Mario Smarduch
  2015-03-02 16:23     ` Catalin Marinas
  1 sibling, 1 reply; 8+ messages in thread
From: Mario Smarduch @ 2015-02-27 21:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 02/27/2015 02:24 AM, Will Deacon wrote:
> On Fri, Feb 27, 2015 at 12:12:32AM +0000, Mario Smarduch wrote:
>> I noticed kernel tlbflush.h use tlbi va*, vaa* variants instead of
>> val, vaal ones. Reading the manual D.5.7.2 it appears that
>> va*, vaa* versions invalidate intermediate caching of
>> translation structures.
>>
>> With stage2 enabled that may result in 20+ memory lookups
>> for a 4 level page table walk. That's assuming that intermediate
>> caching structures cache mappings from stage1 table entry to
>> host page.
> 
> Yeah, Catalin and I discussed improving the kernel support for this,
> but it requires some changes to the generic mmu_gather code so that we
> can distinguish the leaf cases. I'd also like to see that done in a way
> that takes into account different granule sizes (we currently iterate
> over huge pages in 4k chunks). Last time I touched that, I entered a
> world of pain and don't plan to return there immediately :)
> 
> Catalin -- feeling brave?
> 
> FWIW: the new IOMMU page-table stuff I just got merged *does* make use
> of leaf-invalidation for the SMMU.
> 
> Will
> 
Hi Will,
  thanks for the background. I'm guessing how much of PTWalk
is cached is implementation dependent. One old paper quotes upto 40%
improvement for some industry benchmarks that cache all stage1/2 PTWalk
entries.
I guess something to benchmark.

- Mario

^ permalink raw reply	[flat|nested] 8+ messages in thread

* tlbi  va, vaa vs. val, vaal
  2015-02-27 21:15   ` Mario Smarduch
@ 2015-03-02 16:23     ` Catalin Marinas
  2015-03-02 19:26       ` Mario Smarduch
  0 siblings, 1 reply; 8+ messages in thread
From: Catalin Marinas @ 2015-03-02 16:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Feb 27, 2015 at 01:15:57PM -0800, Mario Smarduch wrote:
> On 02/27/2015 02:24 AM, Will Deacon wrote:
> > On Fri, Feb 27, 2015 at 12:12:32AM +0000, Mario Smarduch wrote:
> >> I noticed kernel tlbflush.h use tlbi va*, vaa* variants instead of
> >> val, vaal ones. Reading the manual D.5.7.2 it appears that
> >> va*, vaa* versions invalidate intermediate caching of
> >> translation structures.
> >>
> >> With stage2 enabled that may result in 20+ memory lookups
> >> for a 4 level page table walk. That's assuming that intermediate
> >> caching structures cache mappings from stage1 table entry to
> >> host page.
> > 
> > Yeah, Catalin and I discussed improving the kernel support for this,
> > but it requires some changes to the generic mmu_gather code so that we
> > can distinguish the leaf cases. I'd also like to see that done in a way
> > that takes into account different granule sizes (we currently iterate
> > over huge pages in 4k chunks). Last time I touched that, I entered a
> > world of pain and don't plan to return there immediately :)
> > 
> > Catalin -- feeling brave?
> > 
> > FWIW: the new IOMMU page-table stuff I just got merged *does* make use
> > of leaf-invalidation for the SMMU.
> 
>   thanks for the background. I'm guessing how much of PTWalk
> is cached is implementation dependent. One old paper quotes upto 40%
> improvement for some industry benchmarks that cache all stage1/2 PTWalk
> entries.

Is it caching in the TLB or in the level 1 CPU cache?

I would indeed expect some improvement without many drawbacks. The only
thing we need in Linux is to distinguish between leaf TLBI and TLBI for
page table tearing down. It's not complicated, it just needs some
testing (strangely enough, I tried to replace all user TLBI with the L
variants on a Juno board and no signs of any crashes).

-- 
Catalin

^ permalink raw reply	[flat|nested] 8+ messages in thread

* tlbi  va, vaa vs. val, vaal
  2015-03-02 16:23     ` Catalin Marinas
@ 2015-03-02 19:26       ` Mario Smarduch
  0 siblings, 0 replies; 8+ messages in thread
From: Mario Smarduch @ 2015-03-02 19:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 03/02/2015 08:23 AM, Catalin Marinas wrote:
> On Fri, Feb 27, 2015 at 01:15:57PM -0800, Mario Smarduch wrote:
>> On 02/27/2015 02:24 AM, Will Deacon wrote:
>>> On Fri, Feb 27, 2015 at 12:12:32AM +0000, Mario Smarduch wrote:
>>>> I noticed kernel tlbflush.h use tlbi va*, vaa* variants instead of
>>>> val, vaal ones. Reading the manual D.5.7.2 it appears that
>>>> va*, vaa* versions invalidate intermediate caching of
>>>> translation structures.
>>>>
>>>> With stage2 enabled that may result in 20+ memory lookups
>>>> for a 4 level page table walk. That's assuming that intermediate
>>>> caching structures cache mappings from stage1 table entry to
>>>> host page.
>>>
>>> Yeah, Catalin and I discussed improving the kernel support for this,
>>> but it requires some changes to the generic mmu_gather code so that we
>>> can distinguish the leaf cases. I'd also like to see that done in a way
>>> that takes into account different granule sizes (we currently iterate
>>> over huge pages in 4k chunks). Last time I touched that, I entered a
>>> world of pain and don't plan to return there immediately :)
>>>
>>> Catalin -- feeling brave?
>>>
>>> FWIW: the new IOMMU page-table stuff I just got merged *does* make use
>>> of leaf-invalidation for the SMMU.
>>
>>   thanks for the background. I'm guessing how much of PTWalk
>> is cached is implementation dependent. One old paper quotes upto 40%
>> improvement for some industry benchmarks that cache all stage1/2 PTWalk
>> entries.
> 
> Is it caching in the TLB or in the level 1 CPU cache?

AFAICT this is caching in what other vendors call page walk cache.
It's likely for host - improvements may not be that
dramatic. For Guest 1st stage table/pte lookups
are 2nd stage n-level walks. I would think
performance will vary on CPU implementation of this
intermediate cache especially if nested page entries
are cached. I guess it's likely onc CPU will show
huge improvement and others may not.

> 
> I would indeed expect some improvement without many drawbacks. The only
> thing we need in Linux is to distinguish between leaf TLBI and TLBI for
> page table tearing down. It's not complicated, it just needs some
> testing (strangely enough, I tried to replace all user TLBI with the L
> variants on a Juno board and no signs of any crashes).

I tried that too it worked, but with very minimal test. But
I think I understand what the concern is using the 'L'
variant may leave intermediate table entries cached and corrupt
another process PTW.

- Mario

> 

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-03-02 19:26 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-02-27  0:12 tlbi va, vaa vs. val, vaal Mario Smarduch
2015-02-27 10:24 ` Will Deacon
2015-02-27 10:29   ` Marc Zyngier
2015-02-27 10:33     ` Will Deacon
2015-02-27 10:44       ` Marc Zyngier
2015-02-27 21:15   ` Mario Smarduch
2015-03-02 16:23     ` Catalin Marinas
2015-03-02 19:26       ` Mario Smarduch

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).