* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:54 ` Andy Lutomirski
0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 22:54 UTC (permalink / raw)
To: Rik van Riel
Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>> to it, and this is a low priority for me), which will allow us to skip
>>> expensive TLB flushes while context switching. I have no idea whether
>>> ARM can do something similar.
>>
>> I talked with Dave about implementing PCID and he thinks that it will be
>> net loss. TLB entries will live longer and it means we would need to trigger
>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>> than benifit from hot TLB after context switch.
>
> I suspect that may depend on how you do the shootdown.
>
> If, when receiving a TLB shootdown for a non-current PCID, we just flush
> all the entries for that PCID and remove the CPU from the mm's
> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
> a non-current mm, but we will still get the benefits of TLB longevity
> when dealing with eg. pipe workloads where tasks take turns running on
> the same CPU.
I had a totally different implementation idea in mind. It goes
something like this:
For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
a per-cpu array of the mm [1] that owns each PCID. On context switch,
we look up the new mm in the array and, if there's a PCID mapped, we
switch cr3 and select that PCID. If there is no PCID mapped, we
choose one (LRU? clock replacement?), switch cr3 and select and
invalidate that PCID.
When it's time to invalidate a TLB entry on an mm that's active
remotely, we really don't want to send an IPI to a CPU that doesn't
actually have that mm active. Instead we bump some kind of generation
counter in the mm_struct that will cause the next switch to that mm
not to match the PCID list. To keep this working, I think we also
need to update the per-cpu PCID list with our generation counter
either when we context switch out or when we process a TLB shootdown
IPI.
This could be a bit tricky to get right, but I think it can be done
without adding more than a cacheline or two to the context switch
overhead and without any extra IPIs at all.
[1] It shouldn't be just an mm_struct pointer, because then we have to
invalidate it somehow when we recycle an mm_struct. Maybe we'd use
some kind of counter. We also need a TLB shootdown generation
counter of some sort as described.
--Andy
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
2015-04-28 22:54 ` Andy Lutomirski
@ 2015-04-28 22:56 ` Rik van Riel
-1 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2015-04-28 22:56 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On 04/28/2015 06:54 PM, Andy Lutomirski wrote:
> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>>> to it, and this is a low priority for me), which will allow us to skip
>>>> expensive TLB flushes while context switching. I have no idea whether
>>>> ARM can do something similar.
>>>
>>> I talked with Dave about implementing PCID and he thinks that it will be
>>> net loss. TLB entries will live longer and it means we would need to trigger
>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>>> than benifit from hot TLB after context switch.
>>
>> I suspect that may depend on how you do the shootdown.
>>
>> If, when receiving a TLB shootdown for a non-current PCID, we just flush
>> all the entries for that PCID and remove the CPU from the mm's
>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
>> a non-current mm, but we will still get the benefits of TLB longevity
>> when dealing with eg. pipe workloads where tasks take turns running on
>> the same CPU.
>
> I had a totally different implementation idea in mind. It goes
> something like this:
>
> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
> a per-cpu array of the mm [1] that owns each PCID. On context switch,
> we look up the new mm in the array and, if there's a PCID mapped, we
> switch cr3 and select that PCID. If there is no PCID mapped, we
> choose one (LRU? clock replacement?), switch cr3 and select and
> invalidate that PCID.
>
> When it's time to invalidate a TLB entry on an mm that's active
> remotely, we really don't want to send an IPI to a CPU that doesn't
> actually have that mm active. Instead we bump some kind of generation
> counter in the mm_struct that will cause the next switch to that mm
> not to match the PCID list. To keep this working, I think we also
> need to update the per-cpu PCID list with our generation counter
> either when we context switch out or when we process a TLB shootdown
> IPI.
If we do that, we can also get rid of TLB shootdowns for
idle CPUs in lazy TLB mode.
Very nice, if the details work out.
--
All rights reversed
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 22:56 ` Rik van Riel
0 siblings, 0 replies; 24+ messages in thread
From: Rik van Riel @ 2015-04-28 22:56 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On 04/28/2015 06:54 PM, Andy Lutomirski wrote:
> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>>> to it, and this is a low priority for me), which will allow us to skip
>>>> expensive TLB flushes while context switching. I have no idea whether
>>>> ARM can do something similar.
>>>
>>> I talked with Dave about implementing PCID and he thinks that it will be
>>> net loss. TLB entries will live longer and it means we would need to trigger
>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>>> than benifit from hot TLB after context switch.
>>
>> I suspect that may depend on how you do the shootdown.
>>
>> If, when receiving a TLB shootdown for a non-current PCID, we just flush
>> all the entries for that PCID and remove the CPU from the mm's
>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
>> a non-current mm, but we will still get the benefits of TLB longevity
>> when dealing with eg. pipe workloads where tasks take turns running on
>> the same CPU.
>
> I had a totally different implementation idea in mind. It goes
> something like this:
>
> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
> a per-cpu array of the mm [1] that owns each PCID. On context switch,
> we look up the new mm in the array and, if there's a PCID mapped, we
> switch cr3 and select that PCID. If there is no PCID mapped, we
> choose one (LRU? clock replacement?), switch cr3 and select and
> invalidate that PCID.
>
> When it's time to invalidate a TLB entry on an mm that's active
> remotely, we really don't want to send an IPI to a CPU that doesn't
> actually have that mm active. Instead we bump some kind of generation
> counter in the mm_struct that will cause the next switch to that mm
> not to match the PCID list. To keep this working, I think we also
> need to update the per-cpu PCID list with our generation counter
> either when we context switch out or when we process a TLB shootdown
> IPI.
If we do that, we can also get rid of TLB shootdowns for
idle CPUs in lazy TLB mode.
Very nice, if the details work out.
--
All rights reversed
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
2015-04-28 22:56 ` Rik van Riel
@ 2015-04-28 23:01 ` Andy Lutomirski
-1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:01 UTC (permalink / raw)
To: Rik van Riel
Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 3:56 PM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2015 06:54 PM, Andy Lutomirski wrote:
>> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>>>> to it, and this is a low priority for me), which will allow us to skip
>>>>> expensive TLB flushes while context switching. I have no idea whether
>>>>> ARM can do something similar.
>>>>
>>>> I talked with Dave about implementing PCID and he thinks that it will be
>>>> net loss. TLB entries will live longer and it means we would need to trigger
>>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>>>> than benifit from hot TLB after context switch.
>>>
>>> I suspect that may depend on how you do the shootdown.
>>>
>>> If, when receiving a TLB shootdown for a non-current PCID, we just flush
>>> all the entries for that PCID and remove the CPU from the mm's
>>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
>>> a non-current mm, but we will still get the benefits of TLB longevity
>>> when dealing with eg. pipe workloads where tasks take turns running on
>>> the same CPU.
>>
>> I had a totally different implementation idea in mind. It goes
>> something like this:
>>
>> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
>> a per-cpu array of the mm [1] that owns each PCID. On context switch,
>> we look up the new mm in the array and, if there's a PCID mapped, we
>> switch cr3 and select that PCID. If there is no PCID mapped, we
>> choose one (LRU? clock replacement?), switch cr3 and select and
>> invalidate that PCID.
>>
>> When it's time to invalidate a TLB entry on an mm that's active
>> remotely, we really don't want to send an IPI to a CPU that doesn't
>> actually have that mm active. Instead we bump some kind of generation
>> counter in the mm_struct that will cause the next switch to that mm
>> not to match the PCID list. To keep this working, I think we also
>> need to update the per-cpu PCID list with our generation counter
>> either when we context switch out or when we process a TLB shootdown
>> IPI.
>
> If we do that, we can also get rid of TLB shootdowns for
> idle CPUs in lazy TLB mode.
>
> Very nice, if the details work out.
>
I wonder if we could treat the non-PCID case just like the PCID case
but with only one PCID. Maybe get rid of the mm vs active_mm
distinction. Maybe not, though -- if nothing else, we still need to
kick our pgd out from idle or kthread CPUs before we free it.
The reason I thought of PCIDs this way is that 12 bits isn't nearly
enough to get away with allocating each mm its own PCID. Rather than
trying to shoehorn them in, it seemed like a better approach would be
to only use a very small number, since keeping around TLB entries that
are more than a few context switches old seems mostly useless.
--Andy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:01 ` Andy Lutomirski
0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:01 UTC (permalink / raw)
To: Rik van Riel
Cc: Kirill A. Shutemov, Dave Hansen, Linus Torvalds, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 3:56 PM, Rik van Riel <riel@redhat.com> wrote:
> On 04/28/2015 06:54 PM, Andy Lutomirski wrote:
>> On Tue, Apr 28, 2015 at 3:41 PM, Rik van Riel <riel@redhat.com> wrote:
>>> On 04/28/2015 06:15 PM, Kirill A. Shutemov wrote:
>>>> On Tue, Apr 28, 2015 at 01:42:10PM -0700, Andy Lutomirski wrote:
>>>>> At some point, I'd like to implement PCID on x86 (if no one beats me
>>>>> to it, and this is a low priority for me), which will allow us to skip
>>>>> expensive TLB flushes while context switching. I have no idea whether
>>>>> ARM can do something similar.
>>>>
>>>> I talked with Dave about implementing PCID and he thinks that it will be
>>>> net loss. TLB entries will live longer and it means we would need to trigger
>>>> more IPIs to flash them out when we have to. Cost of IPIs will be higher
>>>> than benifit from hot TLB after context switch.
>>>
>>> I suspect that may depend on how you do the shootdown.
>>>
>>> If, when receiving a TLB shootdown for a non-current PCID, we just flush
>>> all the entries for that PCID and remove the CPU from the mm's
>>> cpu_vm_mask_var, we will never receive more than one shootdown IPI for
>>> a non-current mm, but we will still get the benefits of TLB longevity
>>> when dealing with eg. pipe workloads where tasks take turns running on
>>> the same CPU.
>>
>> I had a totally different implementation idea in mind. It goes
>> something like this:
>>
>> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
>> a per-cpu array of the mm [1] that owns each PCID. On context switch,
>> we look up the new mm in the array and, if there's a PCID mapped, we
>> switch cr3 and select that PCID. If there is no PCID mapped, we
>> choose one (LRU? clock replacement?), switch cr3 and select and
>> invalidate that PCID.
>>
>> When it's time to invalidate a TLB entry on an mm that's active
>> remotely, we really don't want to send an IPI to a CPU that doesn't
>> actually have that mm active. Instead we bump some kind of generation
>> counter in the mm_struct that will cause the next switch to that mm
>> not to match the PCID list. To keep this working, I think we also
>> need to update the per-cpu PCID list with our generation counter
>> either when we context switch out or when we process a TLB shootdown
>> IPI.
>
> If we do that, we can also get rid of TLB shootdowns for
> idle CPUs in lazy TLB mode.
>
> Very nice, if the details work out.
>
I wonder if we could treat the non-PCID case just like the PCID case
but with only one PCID. Maybe get rid of the mm vs active_mm
distinction. Maybe not, though -- if nothing else, we still need to
kick our pgd out from idle or kthread CPUs before we free it.
The reason I thought of PCIDs this way is that 12 bits isn't nearly
enough to get away with allocating each mm its own PCID. Rather than
trying to shoehorn them in, it seemed like a better approach would be
to only use a very small number, since keeping around TLB entries that
are more than a few context switches old seems mostly useless.
--Andy
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
2015-04-28 23:01 ` Andy Lutomirski
@ 2015-04-28 23:19 ` Linus Torvalds
-1 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:19 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 4:01 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> The reason I thought of PCIDs this way is that 12 bits isn't nearly
> enough to get away with allocating each mm its own PCID.
Not even close. And really, we've already done this for other
architectures. On alpha, the number of bits in the pcid is
model-specific, but it was something like 6 for the ones I used.
That's plenty.
Also, I don't think Intel actually does 12 bits of pcid. What they do
is to hash the 12 bits down to something smaller (like two or three
bits in the actual TLB data structure), and then the CPU basically
invalidates any pcid's that alias (have a small 4- or 8-entry array
saying that "this hash was used for this 12-bit pcid).
So there's actually *another* level of dynamic mapping going on below
the software interface.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:19 ` Linus Torvalds
0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:19 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 4:01 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> The reason I thought of PCIDs this way is that 12 bits isn't nearly
> enough to get away with allocating each mm its own PCID.
Not even close. And really, we've already done this for other
architectures. On alpha, the number of bits in the pcid is
model-specific, but it was something like 6 for the ones I used.
That's plenty.
Also, I don't think Intel actually does 12 bits of pcid. What they do
is to hash the 12 bits down to something smaller (like two or three
bits in the actual TLB data structure), and then the CPU basically
invalidates any pcid's that alias (have a small 4- or 8-entry array
saying that "this hash was used for this 12-bit pcid).
So there's actually *another* level of dynamic mapping going on below
the software interface.
Linus
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
2015-04-28 22:54 ` Andy Lutomirski
@ 2015-04-28 23:16 ` Linus Torvalds
-1 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:16 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I had a totally different implementation idea in mind. It goes
> something like this:
>
> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
> a per-cpu array of the mm [1] that owns each PCID. [...]
We've done this before on other architectures. See for example alpha.
Look up "__get_new_mm_context()" and friends. I think sparc does the
same (and I think sparc copied a lot of it from the alpha
implementation).
Iirc, the alpha version just generates a (per-cpu) asid one at a time,
and has a generation counter so that when you run out of ASID's you do
a global TLB invalidate on that CPU and start from 0 again. Actually,
I think the generation number is just the high bits of the asid
counter (alpha calls them "asn", intel calls them "pcid", and I tend
to prefer "asid", but it's all the same thing).
Then each thread just has a per-thread ASID. We don't try to make that
be per-thread and per-cpu, but instead just force a new allocation
when a thread moves to another CPU.
It's not obvious what alpha does, because we end up hiding the
per-thread ASN in the "struct pcb_struct" (in 'struct thread_info')
which is part the alpha pal-code interface. But it seemed to work and
is fairly simple.
I think something very similar should work with intel pcid's.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:16 ` Linus Torvalds
0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:16 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I had a totally different implementation idea in mind. It goes
> something like this:
>
> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
> a per-cpu array of the mm [1] that owns each PCID. [...]
We've done this before on other architectures. See for example alpha.
Look up "__get_new_mm_context()" and friends. I think sparc does the
same (and I think sparc copied a lot of it from the alpha
implementation).
Iirc, the alpha version just generates a (per-cpu) asid one at a time,
and has a generation counter so that when you run out of ASID's you do
a global TLB invalidate on that CPU and start from 0 again. Actually,
I think the generation number is just the high bits of the asid
counter (alpha calls them "asn", intel calls them "pcid", and I tend
to prefer "asid", but it's all the same thing).
Then each thread just has a per-thread ASID. We don't try to make that
be per-thread and per-cpu, but instead just force a new allocation
when a thread moves to another CPU.
It's not obvious what alpha does, because we end up hiding the
per-thread ASN in the "struct pcb_struct" (in 'struct thread_info')
which is part the alpha pal-code interface. But it seemed to work and
is fairly simple.
I think something very similar should work with intel pcid's.
Linus
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
2015-04-28 23:16 ` Linus Torvalds
@ 2015-04-28 23:23 ` Andy Lutomirski
-1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 4:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I had a totally different implementation idea in mind. It goes
>> something like this:
>>
>> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
>> a per-cpu array of the mm [1] that owns each PCID. [...]
>
> We've done this before on other architectures. See for example alpha.
> Look up "__get_new_mm_context()" and friends. I think sparc does the
> same (and I think sparc copied a lot of it from the alpha
> implementation).
>
> Iirc, the alpha version just generates a (per-cpu) asid one at a time,
> and has a generation counter so that when you run out of ASID's you do
> a global TLB invalidate on that CPU and start from 0 again. Actually,
> I think the generation number is just the high bits of the asid
> counter (alpha calls them "asn", intel calls them "pcid", and I tend
> to prefer "asid", but it's all the same thing).
>
> Then each thread just has a per-thread ASID. We don't try to make that
> be per-thread and per-cpu, but instead just force a new allocation
> when a thread moves to another CPU.
Alpha appears to have a per-thread per-cpu id of some sort:
/* The alpha MMU context is one "unsigned long" bitmap per CPU */
typedef unsigned long mm_context_t[NR_CPUS];
I think we can do it without that by keeping the mapping in reverse as
I sort of outlined -- for each cpu, store a mapping from mm to pcid.
When things fall out of the list, no big deal.
--Andy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:23 ` Andy Lutomirski
0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:23 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 4:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2015 at 3:54 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I had a totally different implementation idea in mind. It goes
>> something like this:
>>
>> For each CPU, we allocate a fixed number of PCIDs, e.g. 0-7. We have
>> a per-cpu array of the mm [1] that owns each PCID. [...]
>
> We've done this before on other architectures. See for example alpha.
> Look up "__get_new_mm_context()" and friends. I think sparc does the
> same (and I think sparc copied a lot of it from the alpha
> implementation).
>
> Iirc, the alpha version just generates a (per-cpu) asid one at a time,
> and has a generation counter so that when you run out of ASID's you do
> a global TLB invalidate on that CPU and start from 0 again. Actually,
> I think the generation number is just the high bits of the asid
> counter (alpha calls them "asn", intel calls them "pcid", and I tend
> to prefer "asid", but it's all the same thing).
>
> Then each thread just has a per-thread ASID. We don't try to make that
> be per-thread and per-cpu, but instead just force a new allocation
> when a thread moves to another CPU.
Alpha appears to have a per-thread per-cpu id of some sort:
/* The alpha MMU context is one "unsigned long" bitmap per CPU */
typedef unsigned long mm_context_t[NR_CPUS];
I think we can do it without that by keeping the mapping in reverse as
I sort of outlined -- for each cpu, store a mapping from mm to pcid.
When things fall out of the list, no big deal.
--Andy
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
2015-04-28 23:23 ` Andy Lutomirski
@ 2015-04-28 23:38 ` Linus Torvalds
-1 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:38 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I think we can do it without that by keeping the mapping in reverse as
> I sort of outlined -- for each cpu, store a mapping from mm to pcid.
> When things fall out of the list, no big deal.
So you do it by just having a per-cpu array of (say, 64 entries), you
now end up having to search that every time you do a task switch to
find the asid for the mm. And even then you've limited yourself to
just six bits, because doing the same for a possible full 12-bit asid
would not be possible.
It's actually much simpler if you just do it the other way.
But hey, maybe you do something clever and can figure out a good way
to do it. I'm just saying that we *have* done this before on other
architectures, and it has worked. I think ARM has another asid
implementation in arch/arm/mm/context.c. I really think it would be a
good idea to copy some existing case rather than make up a new one.
It's not like asid's are unusual. It's arguably x86 that was unusual
in _not_ having them.
Linus
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:38 ` Linus Torvalds
0 siblings, 0 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-04-28 23:38 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> I think we can do it without that by keeping the mapping in reverse as
> I sort of outlined -- for each cpu, store a mapping from mm to pcid.
> When things fall out of the list, no big deal.
So you do it by just having a per-cpu array of (say, 64 entries), you
now end up having to search that every time you do a task switch to
find the asid for the mm. And even then you've limited yourself to
just six bits, because doing the same for a possible full 12-bit asid
would not be possible.
It's actually much simpler if you just do it the other way.
But hey, maybe you do something clever and can figure out a good way
to do it. I'm just saying that we *have* done this before on other
architectures, and it has worked. I think ARM has another asid
implementation in arch/arm/mm/context.c. I really think it would be a
good idea to copy some existing case rather than make up a new one.
It's not like asid's are unusual. It's arguably x86 that was unusual
in _not_ having them.
Linus
^ permalink raw reply [flat|nested] 24+ messages in thread* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
2015-04-28 23:38 ` Linus Torvalds
@ 2015-04-28 23:49 ` Andy Lutomirski
-1 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 4:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I think we can do it without that by keeping the mapping in reverse as
>> I sort of outlined -- for each cpu, store a mapping from mm to pcid.
>> When things fall out of the list, no big deal.
>
> So you do it by just having a per-cpu array of (say, 64 entries), you
> now end up having to search that every time you do a task switch to
> find the asid for the mm. And even then you've limited yourself to
> just six bits, because doing the same for a possible full 12-bit asid
> would not be possible.
>
> It's actually much simpler if you just do it the other way.
I'm unconvinced. I doubt that trying to keep more than 4-8 PCIDs
alive in a cpu's TLB is ever a win. After all, the TLB isn't that
big, and, if we're only the 7th most recent mm to have been loaded on
a cpu, I doubt any of our TLB entries are still likely to be there.
Given that, even if we need 16 bytes of generation counter and such in
the per-cpu array, that's at most 128 bytes. In practice, we really
ought to be able to get it down to closer to 8 bytes with some care or
we could only use 4 PCIDs, at which point the whole per-cpu structure
fits in a single cache line. We can search it with 4-8 branches and
no additional L1 misses.
Sure, with 64 entries this would be expensive, but I think that's excessive.
Also, this approach keeps the cost of blowing away stale PCIDs when we
need to invalidate a TLB entry on an inactive PCID down to a single
write as opposed to digging through the per-mm array to poke at the
state for each cpu it might be cached in. But maybe I missed some
trick that avoids needing to do that.
--Andy
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 24+ messages in thread
* Re: PCID and TLB flushes (was: [GIT PULL] kdbus for 4.1-rc1)
@ 2015-04-28 23:49 ` Andy Lutomirski
0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-04-28 23:49 UTC (permalink / raw)
To: Linus Torvalds
Cc: Rik van Riel, Kirill A. Shutemov, Dave Hansen, Andrew Morton,
Mel Gorman, linux-kernel@vger.kernel.org, linux-mm@kvack.org,
X86 ML
On Tue, Apr 28, 2015 at 4:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Apr 28, 2015 at 4:23 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> I think we can do it without that by keeping the mapping in reverse as
>> I sort of outlined -- for each cpu, store a mapping from mm to pcid.
>> When things fall out of the list, no big deal.
>
> So you do it by just having a per-cpu array of (say, 64 entries), you
> now end up having to search that every time you do a task switch to
> find the asid for the mm. And even then you've limited yourself to
> just six bits, because doing the same for a possible full 12-bit asid
> would not be possible.
>
> It's actually much simpler if you just do it the other way.
I'm unconvinced. I doubt that trying to keep more than 4-8 PCIDs
alive in a cpu's TLB is ever a win. After all, the TLB isn't that
big, and, if we're only the 7th most recent mm to have been loaded on
a cpu, I doubt any of our TLB entries are still likely to be there.
Given that, even if we need 16 bytes of generation counter and such in
the per-cpu array, that's at most 128 bytes. In practice, we really
ought to be able to get it down to closer to 8 bytes with some care or
we could only use 4 PCIDs, at which point the whole per-cpu structure
fits in a single cache line. We can search it with 4-8 branches and
no additional L1 misses.
Sure, with 64 entries this would be expensive, but I think that's excessive.
Also, this approach keeps the cost of blowing away stale PCIDs when we
need to invalidate a TLB entry on an inactive PCID down to a single
write as opposed to digging through the per-mm array to poke at the
state for each cpu it might be cached in. But maybe I missed some
trick that avoids needing to do that.
--Andy
^ permalink raw reply [flat|nested] 24+ messages in thread