PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)

All of lore.kernel.org
 help / color / mirror / Atom feed

* PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:15 ` Kirill A. Shutemov
  0 siblings, 0 replies; 24+ messages in thread
From: Kirill A. Shutemov @ 2015-04-28 22:15 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	linux-kernel, linux-mm, x86

On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
> At some point, I'd like to implement PCID on x86 (if no one beats me
> to it, and this is a low priority for me), which will allow us to skip
> expensive TLB flushes while context switching.  I have no idea whether
> ARM can do something similar.

I talked with Dave about implementing PCID and he thinks that it will be
net loss. TLB entries will live longer and it means we would need to trigger
more IPIs to flash them out when we have to. Cost of IPIs will be higher
than benifit from hot TLB after context switch.

Do you have different expectations?

-- 
 Kirill A. Shutemov

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:15 ` Kirill A. Shutemov
  0 siblings, 0 replies; 24+ messages in thread
From: Kirill A. Shutemov @ 2015-04-28 22:15 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	linux-kernel, linux-mm, x86

On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
> At some point, I'd like to implement PCID on x86 (if no one beats me
> to it, and this is a low priority for me), which will allow us to skip
> expensive TLB flushes while context switching.  I have no idea whether
> ARM can do something similar.

I talked with Dave about implementing PCID and he thinks that it will be
net loss. TLB entries will live longer and it means we would need to trigger
more IPIs to flash them out when we have to. Cost of IPIs will be higher
than benifit from hot TLB after context switch.

Do you have different expectations?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 22:15 ` Kirill A. Shutemov
@ 2015-04-28 22:38   ` Dave Hansen
  -1 siblings, 0 replies; 24+ messages in thread
From: Dave Hansen @ 2015-04-28 22:38 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andy Lutomirski
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	linux-kernel, linux-mm, x86

On 04/28/2015 03:15 PM, Kirill A. Shutemov wrote:
> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>> At some point, I'd like to implement PCID on x86 (if no one beats me
>> to it, and this is a low priority for me), which will allow us to skip
>> expensive TLB flushes while context switching.  I have no idea whether
>> ARM can do something similar.
> 
> I talked with Dave about implementing PCID and he thinks that it will be
> net loss. TLB entries will live longer and it means we would need to trigger
> more IPIs to flash them out when we have to. Cost of IPIs will be higher
> than benifit from hot TLB after context switch.
> 
> Do you have different expectations?

Kirill, I think Andy is asking about something different that what you
and I talked about.  My point to you was that PCIDs can not be used to
to replace or in lieu of TLB shootdowns because they *only* make TLB
entries live longer.

Their entire purpose is to make things live longer and to reduce the
cost of the implicit TLB shootdowns that we do as a part of a context
switch.

I'm not sure if it will have a benefit overall.  It depends on the
increase in shootdown cost vs. the decrease in TLB refill cost at
context switch.

I think someone hacked up some code to do it (maybe just internally to
Intel), so if anyone is seriously interested in implementing it, let me
know and I'll see if I can dig it up.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:38   ` Dave Hansen
  0 siblings, 0 replies; 24+ messages in thread
From: Dave Hansen @ 2015-04-28 22:38 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andy Lutomirski
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, Rik van Riel,
	linux-kernel, linux-mm, x86

On 04/28/2015 03:15 PM, Kirill A. Shutemov wrote:
> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>> At some point, I'd like to implement PCID on x86 (if no one beats me
>> to it, and this is a low priority for me), which will allow us to skip
>> expensive TLB flushes while context switching.  I have no idea whether
>> ARM can do something similar.
> 
> I talked with Dave about implementing PCID and he thinks that it will be
> net loss. TLB entries will live longer and it means we would need to trigger
> more IPIs to flash them out when we have to. Cost of IPIs will be higher
> than benifit from hot TLB after context switch.
> 
> Do you have different expectations?

Kirill, I think Andy is asking about something different that what you
and I talked about.  My point to you was that PCIDs can not be used to
to replace or in lieu of TLB shootdowns because they *only* make TLB
entries live longer.

Their entire purpose is to make things live longer and to reduce the
cost of the implicit TLB shootdowns that we do as a part of a context
switch.

I'm not sure if it will have a benefit overall.  It depends on the
increase in shootdown cost vs. the decrease in TLB refill cost at
context switch.

I think someone hacked up some code to do it (maybe just internally to
Intel), so if anyone is seriously interested in implementing it, let me
know and I'll see if I can dig it up.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 22:15 ` Kirill A. Shutemov
@ 2015-04-28 22:41   ` Rik van Riel
  -1 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2015-04-28 22:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andy Lutomirski, Dave Hansen
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, linux-kernel, linux-mm,
	x86

On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>> At some point, I'd like to implement PCID on x86 (if no one beats me
>> to it, and this is a low priority for me), which will allow us to skip
>> expensive TLB flushes while context switching.  I have no idea whether
>> ARM can do something similar.
> 
> I talked with Dave about implementing PCID and he thinks that it will be
> net loss. TLB entries will live longer and it means we would need to trigger
> more IPIs to flash them out when we have to. Cost of IPIs will be higher
> than benifit from hot TLB after context switch.

I suspect that may depend on how you do the shootdown.

If, when receiving a TLB shootdown for a non-current PCID, we just flush
all the entries for that PCID and remove the CPU from the mm's
cpu_vm_mask_var, we will never receive more than one shootdown IPI for
a non-current mm, but we will still get the benefits of TLB longevity
when dealing with eg. pipe workloads where tasks take turns running on
the same CPU.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:41   ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2015-04-28 22:41 UTC (permalink / raw)
  To: Kirill A. Shutemov, Andy Lutomirski, Dave Hansen
  Cc: Linus Torvalds, Andrew Morton, Mel Gorman, linux-kernel, linux-mm,
	x86

On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>> At some point, I'd like to implement PCID on x86 (if no one beats me
>> to it, and this is a low priority for me), which will allow us to skip
>> expensive TLB flushes while context switching.  I have no idea whether
>> ARM can do something similar.
> 
> I talked with Dave about implementing PCID and he thinks that it will be
> net loss. TLB entries will live longer and it means we would need to trigger
> more IPIs to flash them out when we have to. Cost of IPIs will be higher
> than benifit from hot TLB after context switch.

I suspect that may depend on how you do the shootdown.

If, when receiving a TLB shootdown for a non-current PCID, we just flush
all the entries for that PCID and remove the CPU from the mm's
cpu_vm_mask_var, we will never receive more than one shootdown IPI for
a non-current mm, but we will still get the benefits of TLB longevity
when dealing with eg. pipe workloads where tasks take turns running on
the same CPU.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 22:41   ` Rik van Riel
@ 2015-04-28 22:54     ` Andy Lutomirski
  -1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 22:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>> to it, and this is a low priority for me), which will allow us to skip
>>> expensive TLB flushes while context switching.  I have no idea whether
>>> ARM can do something similar.
>>
>> I talked with Dave about implementing PCID and he thinks that it will be
>> net loss. TLB entries will live longer and it means we would need to trigger
>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>> than benifit from hot TLB after context switch.
>
> I suspect that may depend on how you do the shootdown.
>
> If, when receiving a TLB shootdown for a non-current PCID, we just flush
> all the entries for that PCID and remove the CPU from the mm's
> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
> a non-current mm, but we will still get the benefits of TLB longevity
> when dealing with eg. pipe workloads where tasks take turns running on
> the same CPU.

I had a totally different implementation idea in mind.  It goes
something like this:

For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
a per-cpu array of the mm [1] that owns each PCID.  On context switch,
we look up the new mm in the array and, if there's a PCID mapped, we
switch cr3 and select that PCID.  If there is no PCID mapped, we
choose one (LRU?  clock replacement?), switch cr3 and select and
invalidate that PCID.

When it's time to invalidate a TLB entry on an mm that's active
remotely, we really don't want to send an IPI to a CPU that doesn't
actually have that mm active.  Instead we bump some kind of generation
counter in the mm_struct that will cause the next switch to that mm
not to match the PCID list.  To keep this working, I think we also
need to update the per-cpu PCID list with our generation counter
either when we context switch out or when we process a TLB shootdown
IPI.

This could be a bit tricky to get right, but I think it can be done
without adding more than a cacheline or two to the context switch
overhead and without any extra IPIs at all.

[1] It shouldn't be just an mm_struct pointer, because then we have to
invalidate it somehow when we recycle an mm_struct.  Maybe we'd use
some kind of counter.   We also need a TLB shootdown generation
counter of some sort as described.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:54     ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 22:54 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>> to it, and this is a low priority for me), which will allow us to skip
>>> expensive TLB flushes while context switching.  I have no idea whether
>>> ARM can do something similar.
>>
>> I talked with Dave about implementing PCID and he thinks that it will be
>> net loss. TLB entries will live longer and it means we would need to trigger
>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>> than benifit from hot TLB after context switch.
>
> I suspect that may depend on how you do the shootdown.
>
> If, when receiving a TLB shootdown for a non-current PCID, we just flush
> all the entries for that PCID and remove the CPU from the mm's
> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
> a non-current mm, but we will still get the benefits of TLB longevity
> when dealing with eg. pipe workloads where tasks take turns running on
> the same CPU.

I had a totally different implementation idea in mind.  It goes
something like this:

For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
a per-cpu array of the mm [1] that owns each PCID.  On context switch,
we look up the new mm in the array and, if there's a PCID mapped, we
switch cr3 and select that PCID.  If there is no PCID mapped, we
choose one (LRU?  clock replacement?), switch cr3 and select and
invalidate that PCID.

When it's time to invalidate a TLB entry on an mm that's active
remotely, we really don't want to send an IPI to a CPU that doesn't
actually have that mm active.  Instead we bump some kind of generation
counter in the mm_struct that will cause the next switch to that mm
not to match the PCID list.  To keep this working, I think we also
need to update the per-cpu PCID list with our generation counter
either when we context switch out or when we process a TLB shootdown
IPI.

This could be a bit tricky to get right, but I think it can be done
without adding more than a cacheline or two to the context switch
overhead and without any extra IPIs at all.

[1] It shouldn't be just an mm_struct pointer, because then we have to
invalidate it somehow when we recycle an mm_struct.  Maybe we'd use
some kind of counter.   We also need a TLB shootdown generation
counter of some sort as described.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 22:54     ` Andy Lutomirski
@ 2015-04-28 22:56       ` Rik van Riel
  -1 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2015-04-28 22:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On 04/28/2015 06:54 PM, Andy Lutomirski wrote:
> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>>> to it, and this is a low priority for me), which will allow us to skip
>>>> expensive TLB flushes while context switching.  I have no idea whether
>>>> ARM can do something similar.
>>>
>>> I talked with Dave about implementing PCID and he thinks that it will be
>>> net loss. TLB entries will live longer and it means we would need to trigger
>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>>> than benifit from hot TLB after context switch.
>>
>> I suspect that may depend on how you do the shootdown.
>>
>> If, when receiving a TLB shootdown for a non-current PCID, we just flush
>> all the entries for that PCID and remove the CPU from the mm's
>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
>> a non-current mm, but we will still get the benefits of TLB longevity
>> when dealing with eg. pipe workloads where tasks take turns running on
>> the same CPU.
> 
> I had a totally different implementation idea in mind.  It goes
> something like this:
> 
> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
> a per-cpu array of the mm [1] that owns each PCID.  On context switch,
> we look up the new mm in the array and, if there's a PCID mapped, we
> switch cr3 and select that PCID.  If there is no PCID mapped, we
> choose one (LRU?  clock replacement?), switch cr3 and select and
> invalidate that PCID.
> 
> When it's time to invalidate a TLB entry on an mm that's active
> remotely, we really don't want to send an IPI to a CPU that doesn't
> actually have that mm active.  Instead we bump some kind of generation
> counter in the mm_struct that will cause the next switch to that mm
> not to match the PCID list.  To keep this working, I think we also
> need to update the per-cpu PCID list with our generation counter
> either when we context switch out or when we process a TLB shootdown
> IPI.

If we do that, we can also get rid of TLB shootdowns for
idle CPUs in lazy TLB mode.

Very nice, if the details work out.

-- 
All rights reversed

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:56       ` Rik van Riel
  0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2015-04-28 22:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On 04/28/2015 06:54 PM, Andy Lutomirski wrote:
> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>>> to it, and this is a low priority for me), which will allow us to skip
>>>> expensive TLB flushes while context switching.  I have no idea whether
>>>> ARM can do something similar.
>>>
>>> I talked with Dave about implementing PCID and he thinks that it will be
>>> net loss. TLB entries will live longer and it means we would need to trigger
>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>>> than benifit from hot TLB after context switch.
>>
>> I suspect that may depend on how you do the shootdown.
>>
>> If, when receiving a TLB shootdown for a non-current PCID, we just flush
>> all the entries for that PCID and remove the CPU from the mm's
>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
>> a non-current mm, but we will still get the benefits of TLB longevity
>> when dealing with eg. pipe workloads where tasks take turns running on
>> the same CPU.
> 
> I had a totally different implementation idea in mind.  It goes
> something like this:
> 
> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
> a per-cpu array of the mm [1] that owns each PCID.  On context switch,
> we look up the new mm in the array and, if there's a PCID mapped, we
> switch cr3 and select that PCID.  If there is no PCID mapped, we
> choose one (LRU?  clock replacement?), switch cr3 and select and
> invalidate that PCID.
> 
> When it's time to invalidate a TLB entry on an mm that's active
> remotely, we really don't want to send an IPI to a CPU that doesn't
> actually have that mm active.  Instead we bump some kind of generation
> counter in the mm_struct that will cause the next switch to that mm
> not to match the PCID list.  To keep this working, I think we also
> need to update the per-cpu PCID list with our generation counter
> either when we context switch out or when we process a TLB shootdown
> IPI.

If we do that, we can also get rid of TLB shootdowns for
idle CPUs in lazy TLB mode.

Very nice, if the details work out.

-- 
All rights reversed

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 22:56       ` Rik van Riel
@ 2015-04-28 23:01         ` Andy Lutomirski
  -1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 3:56 PM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2015 06:54 PM, Andy Lutomirski wrote:
>> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>>>> to it, and this is a low priority for me), which will allow us to skip
>>>>> expensive TLB flushes while context switching.  I have no idea whether
>>>>> ARM can do something similar.
>>>>
>>>> I talked with Dave about implementing PCID and he thinks that it will be
>>>> net loss. TLB entries will live longer and it means we would need to trigger
>>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>>>> than benifit from hot TLB after context switch.
>>>
>>> I suspect that may depend on how you do the shootdown.
>>>
>>> If, when receiving a TLB shootdown for a non-current PCID, we just flush
>>> all the entries for that PCID and remove the CPU from the mm's
>>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
>>> a non-current mm, but we will still get the benefits of TLB longevity
>>> when dealing with eg. pipe workloads where tasks take turns running on
>>> the same CPU.
>>
>> I had a totally different implementation idea in mind.  It goes
>> something like this:
>>
>> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
>> a per-cpu array of the mm [1] that owns each PCID.  On context switch,
>> we look up the new mm in the array and, if there's a PCID mapped, we
>> switch cr3 and select that PCID.  If there is no PCID mapped, we
>> choose one (LRU?  clock replacement?), switch cr3 and select and
>> invalidate that PCID.
>>
>> When it's time to invalidate a TLB entry on an mm that's active
>> remotely, we really don't want to send an IPI to a CPU that doesn't
>> actually have that mm active.  Instead we bump some kind of generation
>> counter in the mm_struct that will cause the next switch to that mm
>> not to match the PCID list.  To keep this working, I think we also
>> need to update the per-cpu PCID list with our generation counter
>> either when we context switch out or when we process a TLB shootdown
>> IPI.
>
> If we do that, we can also get rid of TLB shootdowns for
> idle CPUs in lazy TLB mode.
>
> Very nice, if the details work out.
>

I wonder if we could treat the non-PCID case just like the PCID case
but with only one PCID.  Maybe get rid of the mm vs active_mm
distinction.  Maybe not, though -- if nothing else, we still need to
kick our pgd out from idle or kthread CPUs before we free it.

The reason I thought of PCIDs this way is that 12 bits isn't nearly
enough to get away with allocating each mm its own PCID.  Rather than
trying to shoehorn them in, it seemed like a better approach would be
to only use a very small number, since keeping around TLB entries that
are more than a few context switches old seems mostly useless.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:01         ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:01 UTC (permalink / raw)
  To: Rik van Riel
  Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 3:56 PM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2015 06:54 PM, Andy Lutomirski wrote:
>> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>>>> to it, and this is a low priority for me), which will allow us to skip
>>>>> expensive TLB flushes while context switching.  I have no idea whether
>>>>> ARM can do something similar.
>>>>
>>>> I talked with Dave about implementing PCID and he thinks that it will be
>>>> net loss. TLB entries will live longer and it means we would need to trigger
>>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>>>> than benifit from hot TLB after context switch.
>>>
>>> I suspect that may depend on how you do the shootdown.
>>>
>>> If, when receiving a TLB shootdown for a non-current PCID, we just flush
>>> all the entries for that PCID and remove the CPU from the mm's
>>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
>>> a non-current mm, but we will still get the benefits of TLB longevity
>>> when dealing with eg. pipe workloads where tasks take turns running on
>>> the same CPU.
>>
>> I had a totally different implementation idea in mind.  It goes
>> something like this:
>>
>> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
>> a per-cpu array of the mm [1] that owns each PCID.  On context switch,
>> we look up the new mm in the array and, if there's a PCID mapped, we
>> switch cr3 and select that PCID.  If there is no PCID mapped, we
>> choose one (LRU?  clock replacement?), switch cr3 and select and
>> invalidate that PCID.
>>
>> When it's time to invalidate a TLB entry on an mm that's active
>> remotely, we really don't want to send an IPI to a CPU that doesn't
>> actually have that mm active.  Instead we bump some kind of generation
>> counter in the mm_struct that will cause the next switch to that mm
>> not to match the PCID list.  To keep this working, I think we also
>> need to update the per-cpu PCID list with our generation counter
>> either when we context switch out or when we process a TLB shootdown
>> IPI.
>
> If we do that, we can also get rid of TLB shootdowns for
> idle CPUs in lazy TLB mode.
>
> Very nice, if the details work out.
>

I wonder if we could treat the non-PCID case just like the PCID case
but with only one PCID.  Maybe get rid of the mm vs active_mm
distinction.  Maybe not, though -- if nothing else, we still need to
kick our pgd out from idle or kthread CPUs before we free it.

The reason I thought of PCIDs this way is that 12 bits isn't nearly
enough to get away with allocating each mm its own PCID.  Rather than
trying to shoehorn them in, it seemed like a better approach would be
to only use a very small number, since keeping around TLB entries that
are more than a few context switches old seems mostly useless.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 23:01         ` Andy Lutomirski
@ 2015-04-28 23:19           ` Linus Torvalds
  -1 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 4:01 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> The reason I thought of PCIDs this way is that 12 bits isn't nearly
> enough to get away with allocating each mm its own PCID.

Not even close. And really, we've already done this for other
architectures. On alpha, the number of bits in the pcid is
model-specific, but it was something like 6 for the ones I used.
That's plenty.

Also, I don't think Intel actually does 12 bits of pcid. What they do
is to hash the 12 bits down to something smaller (like two or three
bits in the actual TLB data structure), and then the CPU basically
invalidates any pcid's that alias (have a small 4- or 8-entry array
saying that "this hash was used for this 12-bit pcid).

So there's actually *another* level of dynamic mapping going on below
the software interface.

                         Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:19           ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:19 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 4:01 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> The reason I thought of PCIDs this way is that 12 bits isn't nearly
> enough to get away with allocating each mm its own PCID.

Not even close. And really, we've already done this for other
architectures. On alpha, the number of bits in the pcid is
model-specific, but it was something like 6 for the ones I used.
That's plenty.

Also, I don't think Intel actually does 12 bits of pcid. What they do
is to hash the 12 bits down to something smaller (like two or three
bits in the actual TLB data structure), and then the CPU basically
invalidates any pcid's that alias (have a small 4- or 8-entry array
saying that "this hash was used for this 12-bit pcid).

So there's actually *another* level of dynamic mapping going on below
the software interface.

                         Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 22:54     ` Andy Lutomirski
@ 2015-04-28 23:16       ` Linus Torvalds
  -1 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I had a totally different implementation idea in mind.  It goes
> something like this:
>
> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
> a per-cpu array of the mm [1] that owns each PCID. [...]

We've done this before on other architectures.  See for example alpha.
Look up "__get_new_mm_context()" and friends. I think sparc does the
same (and I think sparc copied a lot of it from the alpha
implementation).

Iirc, the alpha version just generates a (per-cpu) asid one at a time,
and has a generation counter so that when you run out of ASID's you do
a global TLB invalidate on that CPU and start from 0 again. Actually,
I think the generation number is just the high bits of the asid
counter (alpha calls them "asn", intel calls them "pcid", and I tend
to prefer "asid", but it's all the same thing).

Then each thread just has a per-thread ASID. We don't try to make that
be per-thread and per-cpu, but instead just force a new allocation
when a thread moves to another CPU.

It's not obvious what alpha does, because we end up hiding the
per-thread ASN in the "struct pcb_struct" (in 'struct thread_info')
which is part the alpha pal-code interface. But it seemed to work and
is fairly simple.

I think something very similar should work with intel pcid's.

                        Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:16       ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I had a totally different implementation idea in mind.  It goes
> something like this:
>
> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
> a per-cpu array of the mm [1] that owns each PCID. [...]

We've done this before on other architectures.  See for example alpha.
Look up "__get_new_mm_context()" and friends. I think sparc does the
same (and I think sparc copied a lot of it from the alpha
implementation).

Iirc, the alpha version just generates a (per-cpu) asid one at a time,
and has a generation counter so that when you run out of ASID's you do
a global TLB invalidate on that CPU and start from 0 again. Actually,
I think the generation number is just the high bits of the asid
counter (alpha calls them "asn", intel calls them "pcid", and I tend
to prefer "asid", but it's all the same thing).

Then each thread just has a per-thread ASID. We don't try to make that
be per-thread and per-cpu, but instead just force a new allocation
when a thread moves to another CPU.

It's not obvious what alpha does, because we end up hiding the
per-thread ASN in the "struct pcb_struct" (in 'struct thread_info')
which is part the alpha pal-code interface. But it seemed to work and
is fairly simple.

I think something very similar should work with intel pcid's.

                        Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 23:16       ` Linus Torvalds
@ 2015-04-28 23:23         ` Andy Lutomirski
  -1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 4:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I had a totally different implementation idea in mind.  It goes
>> something like this:
>>
>> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
>> a per-cpu array of the mm [1] that owns each PCID. [...]
>
> We've done this before on other architectures.  See for example alpha.
> Look up "__get_new_mm_context()" and friends. I think sparc does the
> same (and I think sparc copied a lot of it from the alpha
> implementation).
>
> Iirc, the alpha version just generates a (per-cpu) asid one at a time,
> and has a generation counter so that when you run out of ASID's you do
> a global TLB invalidate on that CPU and start from 0 again. Actually,
> I think the generation number is just the high bits of the asid
> counter (alpha calls them "asn", intel calls them "pcid", and I tend
> to prefer "asid", but it's all the same thing).
>
> Then each thread just has a per-thread ASID. We don't try to make that
> be per-thread and per-cpu, but instead just force a new allocation
> when a thread moves to another CPU.

Alpha appears to have a per-thread per-cpu id of some sort:

/* The alpha MMU context is one "unsigned long" bitmap per CPU */
typedef unsigned long mm_context_t[NR_CPUS];

I think we can do it without that by keeping the mapping in reverse as
I sort of outlined -- for each cpu, store a mapping from mm to pcid.
When things fall out of the list, no big deal.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:23         ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 4:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I had a totally different implementation idea in mind.  It goes
>> something like this:
>>
>> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7.  We have
>> a per-cpu array of the mm [1] that owns each PCID. [...]
>
> We've done this before on other architectures.  See for example alpha.
> Look up "__get_new_mm_context()" and friends. I think sparc does the
> same (and I think sparc copied a lot of it from the alpha
> implementation).
>
> Iirc, the alpha version just generates a (per-cpu) asid one at a time,
> and has a generation counter so that when you run out of ASID's you do
> a global TLB invalidate on that CPU and start from 0 again. Actually,
> I think the generation number is just the high bits of the asid
> counter (alpha calls them "asn", intel calls them "pcid", and I tend
> to prefer "asid", but it's all the same thing).
>
> Then each thread just has a per-thread ASID. We don't try to make that
> be per-thread and per-cpu, but instead just force a new allocation
> when a thread moves to another CPU.

Alpha appears to have a per-thread per-cpu id of some sort:

/* The alpha MMU context is one "unsigned long" bitmap per CPU */
typedef unsigned long mm_context_t[NR_CPUS];

I think we can do it without that by keeping the mapping in reverse as
I sort of outlined -- for each cpu, store a mapping from mm to pcid.
When things fall out of the list, no big deal.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 23:23         ` Andy Lutomirski
@ 2015-04-28 23:38           ` Linus Torvalds
  -1 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I think we can do it without that by keeping the mapping in reverse as
> I sort of outlined -- for each cpu, store a mapping from mm to pcid.
> When things fall out of the list, no big deal.

So you do it by just having a per-cpu array of (say, 64 entries), you
now end up having to search that every time you do a task switch to
find the asid for the mm. And even then you've limited yourself to
just six bits, because doing the same for a possible full 12-bit asid
would not be possible.

It's actually much simpler if you just do it the other way.

But hey, maybe you do something clever and can figure out a good way
to do it. I'm just saying that we *have* done this before on other
architectures, and it has worked. I think ARM has another asid
implementation in arch/arm/mm/context.c. I really think it would be a
good idea to copy some existing case rather than make up a new one.
It's not like asid's are unusual. It's arguably x86 that was unusual
in _not_ having them.

                     Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:38           ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I think we can do it without that by keeping the mapping in reverse as
> I sort of outlined -- for each cpu, store a mapping from mm to pcid.
> When things fall out of the list, no big deal.

So you do it by just having a per-cpu array of (say, 64 entries), you
now end up having to search that every time you do a task switch to
find the asid for the mm. And even then you've limited yourself to
just six bits, because doing the same for a possible full 12-bit asid
would not be possible.

It's actually much simpler if you just do it the other way.

But hey, maybe you do something clever and can figure out a good way
to do it. I'm just saying that we *have* done this before on other
architectures, and it has worked. I think ARM has another asid
implementation in arch/arm/mm/context.c. I really think it would be a
good idea to copy some existing case rather than make up a new one.
It's not like asid's are unusual. It's arguably x86 that was unusual
in _not_ having them.

                     Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 23:38           ` Linus Torvalds
@ 2015-04-28 23:49             ` Andy Lutomirski
  -1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 4:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I think we can do it without that by keeping the mapping in reverse as
>> I sort of outlined -- for each cpu, store a mapping from mm to pcid.
>> When things fall out of the list, no big deal.
>
> So you do it by just having a per-cpu array of (say, 64 entries), you
> now end up having to search that every time you do a task switch to
> find the asid for the mm. And even then you've limited yourself to
> just six bits, because doing the same for a possible full 12-bit asid
> would not be possible.
>
> It's actually much simpler if you just do it the other way.

I'm unconvinced.  I doubt that trying to keep more than 4-8 PCIDs
alive in a cpu's TLB is ever a win.  After all, the TLB isn't that
big, and, if we're only the 7th most recent mm to have been loaded on
a cpu, I doubt any of our TLB entries are still likely to be there.

Given that, even if we need 16 bytes of generation counter and such in
the per-cpu array, that's at most 128 bytes.  In practice, we really
ought to be able to get it down to closer to 8 bytes with some care or
we could only use 4 PCIDs, at which point the whole per-cpu structure
fits in a single cache line.  We can search it with 4-8 branches and
no additional L1 misses.

Sure, with 64 entries this would be expensive, but I think that's excessive.

Also, this approach keeps the cost of blowing away stale PCIDs when we
need to invalidate a TLB entry on an inactive PCID down to a single
write as opposed to digging through the per-mm array to poke at the
state for each cpu it might be cached in.  But maybe I missed some
trick that avoids needing to do that.

--Andy

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:49             ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:49 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
	Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	X86 ML

On Tue, Apr 28, 2015 at 4:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I think we can do it without that by keeping the mapping in reverse as
>> I sort of outlined -- for each cpu, store a mapping from mm to pcid.
>> When things fall out of the list, no big deal.
>
> So you do it by just having a per-cpu array of (say, 64 entries), you
> now end up having to search that every time you do a task switch to
> find the asid for the mm. And even then you've limited yourself to
> just six bits, because doing the same for a possible full 12-bit asid
> would not be possible.
>
> It's actually much simpler if you just do it the other way.

I'm unconvinced.  I doubt that trying to keep more than 4-8 PCIDs
alive in a cpu's TLB is ever a win.  After all, the TLB isn't that
big, and, if we're only the 7th most recent mm to have been loaded on
a cpu, I doubt any of our TLB entries are still likely to be there.

Given that, even if we need 16 bytes of generation counter and such in
the per-cpu array, that's at most 128 bytes.  In practice, we really
ought to be able to get it down to closer to 8 bytes with some care or
we could only use 4 PCIDs, at which point the whole per-cpu structure
fits in a single cache line.  We can search it with 4-8 branches and
no additional L1 misses.

Sure, with 64 entries this would be expensive, but I think that's excessive.

Also, this approach keeps the cost of blowing away stale PCIDs when we
need to invalidate a TLB entry on an inactive PCID down to a single
write as opposed to digging through the per-mm array to poke at the
state for each cpu it might be cached in.  But maybe I missed some
trick that avoids needing to do that.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
  2015-04-28 22:15 ` Kirill A. Shutemov
@ 2015-04-28 22:56   ` Linus Torvalds
  -1 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 22:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Dave Hansen, Andrew Morton, Mel Gorman,
	Rik van Riel, Linux Kernel Mailing List, linux-mm,
	the arch/x86 maintainers

On Tue, Apr 28, 2015 at 3:15 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> I talked with Dave about implementing PCID and he thinks that it will be
> net loss.

So I'm told that Suresh Siddha actually had a patch inside Intel to
use PCID (back when he worked for Intel, I think he left), and that it
was a wash in their testing.

I never saw the patch, and it might be interesting to try it again,
but there is some reason to believe that it doesn't make much of a
difference. Unlike most of the traditional RISC machines that got big
speedups, Intel TLB walking is so good that it likely isn't nearly as
noticeable, and it likely *does* result in more IPI's etc. Possibly
not a lot more, but if the win isn't big...

So I don't want to discourage you, because I'd love to see what the
patch looks like and if we can find cases where it matters, but I do
want to set expectations right. It's unlikely to be a big issue.

                           Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:56   ` Linus Torvalds
  0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 22:56 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Dave Hansen, Andrew Morton, Mel Gorman,
	Rik van Riel, Linux Kernel Mailing List, linux-mm,
	the arch/x86 maintainers

On Tue, Apr 28, 2015 at 3:15 PM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
>
> I talked with Dave about implementing PCID and he thinks that it will be
> net loss.

So I'm told that Suresh Siddha actually had a patch inside Intel to
use PCID (back when he worked for Intel, I think he left), and that it
was a wash in their testing.

I never saw the patch, and it might be interesting to try it again,
but there is some reason to believe that it doesn't make much of a
difference. Unlike most of the traditional RISC machines that got big
speedups, Intel TLB walking is so good that it likely isn't nearly as
noticeable, and it likely *does* result in more IPI's etc. Possibly
not a lot more, but if the win isn't big...

So I don't want to discourage you, because I'd love to see what the
patch looks like and if we can find cases where it matters, but I do
want to set expectations right. It's unlikely to be a big issue.

                           Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-04-28 23:49 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-04-28 22:15 PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1) Kirill A. Shutemov
2015-04-28 22:15 ` Kirill A. Shutemov
2015-04-28 22:38 ` Dave Hansen
2015-04-28 22:38   ` Dave Hansen
2015-04-28 22:41 ` Rik van Riel
2015-04-28 22:41   ` Rik van Riel
2015-04-28 22:54   ` Andy Lutomirski
2015-04-28 22:54     ` Andy Lutomirski
2015-04-28 22:56     ` Rik van Riel
2015-04-28 22:56       ` Rik van Riel
2015-04-28 23:01       ` Andy Lutomirski
2015-04-28 23:01         ` Andy Lutomirski
2015-04-28 23:19         ` Linus Torvalds
2015-04-28 23:19           ` Linus Torvalds
2015-04-28 23:16     ` Linus Torvalds
2015-04-28 23:16       ` Linus Torvalds
2015-04-28 23:23       ` Andy Lutomirski
2015-04-28 23:23         ` Andy Lutomirski
2015-04-28 23:38         ` Linus Torvalds
2015-04-28 23:38           ` Linus Torvalds
2015-04-28 23:49           ` Andy Lutomirski
2015-04-28 23:49             ` Andy Lutomirski
2015-04-28 22:56 ` Linus Torvalds
2015-04-28 22:56   ` Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.