* [RFC PATCH 0/2] Fix a couple of issues with zap_pte_range and MMU gather
@ 2014-10-28 11:44 Will Deacon
2014-10-28 11:44 ` [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure Will Deacon
2014-10-28 11:44 ` [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte Will Deacon
0 siblings, 2 replies; 27+ messages in thread
From: Will Deacon @ 2014-10-28 11:44 UTC (permalink / raw)
To: torvalds, peterz; +Cc: linux-kernel, linux, benh, Will Deacon
Hi all,
This patch series attempts to fix a couple of issues I've noticed with
zap_pte_range and the MMU gather code on arm64.
Ths first fix resolves a TLB range truncation, which I found by code
inspection (this is on the batch failure path, which doesn't appear to
be regularly exercised on my system).
For the second fix, I'd really appreciate some comments. The problem is
that the architecture TLB batching implementation may update the start
and end fields of the gather structure, so that they actually cover only
a subset of the initial range set up by tlb_gather_mmu (based on calls
to tlb_remove_tlb_entry). In the force_flush case, zap_pte_range sets
these fields directly, which can result in a negative range if the
architecture has also updated the end address. The patch here uses
min(end, addr) as the end of the first range, which creates a second
range from that address to the end of the region. This results in a
potential over-invalidation on arm64, but I can't think of anything
better without updating (at least) the x86 tlb.h implementation.
Ideally, we'd let the architecture set start/end during the call to
tlb_flush_mmu_tlbonly (arm64 does this already in tlb_flush).
Thoughts?
Will
Will Deacon (2):
zap_pte_range: update addr when forcing flush after TLB batching
faiure
zap_pte_range: fix partial TLB flushing in response to a dirty pte
mm/memory.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
--
2.1.1
^ permalink raw reply [flat|nested] 27+ messages in thread* [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 11:44 [RFC PATCH 0/2] Fix a couple of issues with zap_pte_range and MMU gather Will Deacon @ 2014-10-28 11:44 ` Will Deacon 2014-10-28 15:30 ` Linus Torvalds 2014-10-28 11:44 ` [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte Will Deacon 1 sibling, 1 reply; 27+ messages in thread From: Will Deacon @ 2014-10-28 11:44 UTC (permalink / raw) To: torvalds, peterz; +Cc: linux-kernel, linux, benh, Will Deacon When unmapping a range of pages in zap_pte_range, the page being unmapped is added to an mmu_gather_batch structure for asynchronous freeing. If we run out of space in the batch structure before the range has been completely unmapped, then we break out of the loop, force a TLB flush and free the pages that we have batched so far. If there are further pages to unmap, then we resume the loop where we left off. Unfortunately, we forget to update addr when we break out of the loop, which causes us to truncate the range being invalidated as the end address is exclusive. When we re-enter the loop at the same address, the page has already been freed and the pte_present test will fail, meaning that we do not reconsider the address for invalidation. This patch fixes the problem by incrementing addr by the PAGE_SIZE before breaking out of the loop on batch failure. Signed-off-by: Will Deacon <will.deacon@arm.com> --- mm/memory.c | 1 + 1 file changed, 1 insertion(+) diff --git a/mm/memory.c b/mm/memory.c index 1cc6bfbd872e..3e503831e042 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1147,6 +1147,7 @@ again: print_bad_pte(vma, addr, ptent, page); if (unlikely(!__tlb_remove_page(tlb, page))) { force_flush = 1; + addr += PAGE_SIZE; break; } continue; -- 2.1.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 11:44 ` [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure Will Deacon @ 2014-10-28 15:30 ` Linus Torvalds 2014-10-28 16:07 ` Will Deacon 2014-10-28 21:40 ` Linus Torvalds 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2014-10-28 15:30 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote: > > This patch fixes the problem by incrementing addr by the PAGE_SIZE > before breaking out of the loop on batch failure. This patch seems harmless and right, unlike the other one. I'd be ok with changing the *generic* code to do the "update start/end pointers in the mmu_gather structure", but then it really has to be done in *generic* code. Move your arm64 start/end updates to include/asm-generic/tlb.h, and change the initialization of start/end entirely. Right now we initialize those things to the maximal range (see tlb_gather_mmu()), and the arm64 logic seems to be to initialize them to TASK_SIZE/0 respectively and then update start/end as you add pages, so that you get the minimal range. But because of this arm64 confusion (the "minimal range" really is *not* how this is designed to work), the current existing tlb_gather_mmu() does the wrong initialization. In other words: my argument is that right now the arm64 code is just *wrong*. I'd be ok with making it the right thing to do, but if so, it needs to be made generic. Are there any actual advantages to teh whole "minimal range" model? It adds overhead and complexity, and the only case where it would seem to be worth it is for benchmarks that do mmap/munmap in a loop and then only map a single page. Normal loads don't tend to have those kinds of "minimal range is very different from whole range" cases. Do you have a real case for why it does that minimal range thing. Because if you don't have a real case, I'd really suggest you get rid of all the arm64 games with start/end. That still leaves this one patch the correct thing to do (because even without the start/end games the "need_flush" flag goes with the range, but it makes the second patch a complete non-issue. If you *do* have a real case, I think you need to modify your second patch to: - move the arm start/end updates from tlb_flush/tlb_add_flush to asm-generic - make tlb_gather_mmu() initialize start/end to TASK_SIZE/0 the same way your tlb_flush does (so that the subsequent min/max games work). so that *everybody* does the start/end games. I'd be ok with that. What I'm not ok with is arm64 using the generic TLB gather way in odd ways that then breaks code and results in things like your 2/2 patch that fixes ARM64 but breaks x86. Hmm? Is there something I'm missing? Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 15:30 ` Linus Torvalds @ 2014-10-28 16:07 ` Will Deacon 2014-10-28 16:25 ` Linus Torvalds 2014-10-28 21:40 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Will Deacon @ 2014-10-28 16:07 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt Hi Linus, On Tue, Oct 28, 2014 at 03:30:43PM +0000, Linus Torvalds wrote: > On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote: > > This patch fixes the problem by incrementing addr by the PAGE_SIZE > > before breaking out of the loop on batch failure. > > This patch seems harmless and right, unlike the other one. Yes; the other patch was more of a discussion point, as I was really struggling to solve the problem without changing arch code. It was also broken, as you discovered. > I'd be ok with changing the *generic* code to do the "update start/end > pointers in the mmu_gather structure", but then it really has to be > done in *generic* code. Move your arm64 start/end updates to > include/asm-generic/tlb.h, and change the initialization of start/end > entirely. Right now we initialize those things to the maximal range > (see tlb_gather_mmu()), and the arm64 logic seems to be to initialize > them to TASK_SIZE/0 respectively and then update start/end as you add > pages, so that you get the minimal range. > > But because of this arm64 confusion (the "minimal range" really is > *not* how this is designed to work), the current existing > tlb_gather_mmu() does the wrong initialization. > > In other words: my argument is that right now the arm64 code is just > *wrong*. I'd be ok with making it the right thing to do, but if so, it > needs to be made generic. Ok, that's useful, thanks. Out of curiosity, what *is* the current intention of __tlb_remove_tlb_entry, if start/end shouldn't be touched by architectures? Is it just for the PPC hash thing? > Are there any actual advantages to teh whole "minimal range" model? It > adds overhead and complexity, and the only case where it would seem to > be worth it is for benchmarks that do mmap/munmap in a loop and then > only map a single page. Normal loads don't tend to have those kinds of > "minimal range is very different from whole range" cases. Do you have > a real case for why it does that minimal range thing. I was certainly seeing this issue trigger regularly when running firefox, but I'll need to dig and find out the differences in range size. > Because if you don't have a real case, I'd really suggest you get rid > of all the arm64 games with start/end. That still leaves this one > patch the correct thing to do (because even without the start/end > games the "need_flush" flag goes with the range, but it makes the > second patch a complete non-issue. Since we have hardware broadcasting of TLB invalidations on ARM, it is in our interest to keep the number of outstanding operations as small as possible, particularly on large systems where we don't get the targetted shootdown with a single message that you can perform using IPIs (i.e. you can only broadcast to all or no CPUs, and that happens for each pte). Now, what I'd actually like to do is to keep track of discrete ranges, so that we can avoid sending a TLB invalidation for each PAGE_SIZE region of a huge-page mapping. That information is lost with our current minimal range scheme and it was whilst I was thinking about this that I noticed we were getting passed negative ranges with the current implementation. > If you *do* have a real case, I think you need to modify your second patch to: > > - move the arm start/end updates from tlb_flush/tlb_add_flush to asm-generic > > - make tlb_gather_mmu() initialize start/end to TASK_SIZE/0 the same > way your tlb_flush does (so that the subsequent min/max games work). > > so that *everybody* does the start/end games. I'd be ok with that. I'll try and come up with something which addresses the above, and we can go from there. > What I'm not ok with is arm64 using the generic TLB gather way in odd > ways that then breaks code and results in things like your 2/2 patch > that fixes ARM64 but breaks x86. Understood, that certainly wasn't my intention. Cheers, Will ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 16:07 ` Will Deacon @ 2014-10-28 16:25 ` Linus Torvalds 2014-10-28 17:07 ` Will Deacon 2014-10-28 21:16 ` Benjamin Herrenschmidt 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2014-10-28 16:25 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Oct 28, 2014 at 9:07 AM, Will Deacon <will.deacon@arm.com> wrote: > > Ok, that's useful, thanks. Out of curiosity, what *is* the current intention > of __tlb_remove_tlb_entry, if start/end shouldn't be touched by > architectures? Is it just for the PPC hash thing? I think it's both the PPC hash, and for "legacy reasons" (ie architectures that don't use the generic code, and were converted from the "invalidate as you walk the tables" without ever really fixing the "you have to flush the TLB before you free the page, and do batching"). It would be lovely if we could just drop it entirely, although changing it to actively do the minimal range is fine too. > I was certainly seeing this issue trigger regularly when running firefox, > but I'll need to dig and find out the differences in range size. I'm wondering whether that was perhaps because of the mix-up with initialization of the range. Afaik, that would always break your min/max thing for the first batch (and since the batches are fairly large, "first" may be "only") But hey. it's possible that firefox does some big mappings but only populates the beginning. Most architectures don't tend to have excessive glass jaws in this area: invalidating things page-by-page is invariably so slow that at some point you just go "just do the whole range". > Since we have hardware broadcasting of TLB invalidations on ARM, it is > in our interest to keep the number of outstanding operations as small as > possible, particularly on large systems where we don't get the targetted > shootdown with a single message that you can perform using IPIs (i.e. > you can only broadcast to all or no CPUs, and that happens for each pte). Do you seriously *have* to broadcast for each pte? Because that is quite frankly moronic. We batch things up in software for a real good reason: doing things one entry at a time just cannot ever scale. At some point (and that point is usually not even very far away), it's much better to do a single invalidate over a range. The cost of having to refill the TLB's is *much* smaller than the cost of doing tons of cross-CPU invalidates. That's true even for the cases where we track the CPU's involved in that mapping, and only invalidate a small subset. With a "all CPU's broadcast", the cross-over point must be even smaller. Doing thousands of CPU broadcasts is just crazy, even if they are hw-accelerated. Can't you just do a full invalidate and a SW IPI for larger ranges? And as mentioned, true sparse mappings are actually fairly rare, so making extra effort (and data structures) to have individual ranges sounds crazy. Is this some hw-enforced thing? You really can't turn off the cross-cpu-for-each-pte braindamage? Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 16:25 ` Linus Torvalds @ 2014-10-28 17:07 ` Will Deacon 2014-10-28 18:03 ` Linus Torvalds 2014-10-28 21:16 ` Benjamin Herrenschmidt 1 sibling, 1 reply; 27+ messages in thread From: Will Deacon @ 2014-10-28 17:07 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Oct 28, 2014 at 04:25:35PM +0000, Linus Torvalds wrote: > On Tue, Oct 28, 2014 at 9:07 AM, Will Deacon <will.deacon@arm.com> wrote: > > I was certainly seeing this issue trigger regularly when running firefox, > > but I'll need to dig and find out the differences in range size. > > I'm wondering whether that was perhaps because of the mix-up with > initialization of the range. Afaik, that would always break your > min/max thing for the first batch (and since the batches are fairly > large, "first" may be "only") > > But hey. it's possible that firefox does some big mappings but only > populates the beginning. Most architectures don't tend to have > excessive glass jaws in this area: invalidating things page-by-page is > invariably so slow that at some point you just go "just do the whole > range". > > > Since we have hardware broadcasting of TLB invalidations on ARM, it is > > in our interest to keep the number of outstanding operations as small as > > possible, particularly on large systems where we don't get the targetted > > shootdown with a single message that you can perform using IPIs (i.e. > > you can only broadcast to all or no CPUs, and that happens for each pte). > > Do you seriously *have* to broadcast for each pte? > > Because that is quite frankly moronic. We batch things up in software > for a real good reason: doing things one entry at a time just cannot > ever scale. At some point (and that point is usually not even very far > away), it's much better to do a single invalidate over a range. The > cost of having to refill the TLB's is *much* smaller than the cost of > doing tons of cross-CPU invalidates. I don't think that's necessarily true, at least not on the systems I'm familiar with. A table walk can be comparatively expensive, particularly when virtualisation is involved and the depth of the host and guest page tables starts to grow -- we're talking >20 memory accesses per walk. By contrast, the TLB invalidation messages are asynchronous and carried on the interconnect (a DSB instruction is used to synchronise the updates). > That's true even for the cases where we track the CPU's involved in > that mapping, and only invalidate a small subset. With a "all CPU's > broadcast", the cross-over point must be even smaller. Doing thousands > of CPU broadcasts is just crazy, even if they are hw-accelerated. > > Can't you just do a full invalidate and a SW IPI for larger ranges? We already do that, but it's mainly there to catch *really* large ranges (like the negative ones...), which can trigger the soft lockup detector. The cases we've seen for this so far have been bugs (e.g. this thread and also a related issue where we try to flush the whole of vmalloc space). > And as mentioned, true sparse mappings are actually fairly rare, so > making extra effort (and data structures) to have individual ranges > sounds crazy. Sure, I'll try and get some data on this. I'd like to resolve the THP case, at least, which means keeping track of calls to __tlb_remove_pmd_tlb_entry. > Is this some hw-enforced thing? You really can't turn off the > cross-cpu-for-each-pte braindamage? We could use IPIs if we wanted to and issue local TLB invalidations on the targetted cores, but I'd be surprised if this showed an improvement on ARM-based systems. Will ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 17:07 ` Will Deacon @ 2014-10-28 18:03 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2014-10-28 18:03 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Oct 28, 2014 at 10:07 AM, Will Deacon <will.deacon@arm.com> wrote: > > I don't think that's necessarily true, at least not on the systems I'm > familiar with. A table walk can be comparatively expensive, particularly > when virtualisation is involved and the depth of the host and guest page > tables starts to grow -- we're talking >20 memory accesses per walk. By > contrast, the TLB invalidation messages are asynchronous and carried on > the interconnect (a DSB instruction is used to synchronise the updates). ">20 memory accesses per *walk*"? Isn't the ARM a regular table? So once you've gone down to the pte level, it's just an array, regardless of how many levels there are. But I guess there are no actual multi-socket ARM's around in real life, so you probably don't see the real scaling costs. Within a die, you're probably right that the overhead is negligible. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 16:25 ` Linus Torvalds 2014-10-28 17:07 ` Will Deacon @ 2014-10-28 21:16 ` Benjamin Herrenschmidt 2014-10-28 21:32 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Benjamin Herrenschmidt @ 2014-10-28 21:16 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux On Tue, 2014-10-28 at 09:25 -0700, Linus Torvalds wrote: > > Since we have hardware broadcasting of TLB invalidations on ARM, it is > > in our interest to keep the number of outstanding operations as small as > > possible, particularly on large systems where we don't get the targetted > > shootdown with a single message that you can perform using IPIs (i.e. > > you can only broadcast to all or no CPUs, and that happens for each pte). > > Do you seriously *have* to broadcast for each pte? We do too, in current CPUs at least, it's sad ... > Because that is quite frankly moronic. We batch things up in software > for a real good reason: doing things one entry at a time just cannot > ever scale. At some point (and that point is usually not even very far > away), it's much better to do a single invalidate over a range. The > cost of having to refill the TLB's is *much* smaller than the cost of > doing tons of cross-CPU invalidates. > > That's true even for the cases where we track the CPU's involved in > that mapping, and only invalidate a small subset. With a "all CPU's > broadcast", the cross-over point must be even smaller. Doing thousands > of CPU broadcasts is just crazy, even if they are hw-accelerated. > > Can't you just do a full invalidate and a SW IPI for larger ranges? For us, this would be great except ... we can potentially have other agents with an MMU that only support snooping of the broadcasts... > And as mentioned, true sparse mappings are actually fairly rare, so > making extra effort (and data structures) to have individual ranges > sounds crazy. > > Is this some hw-enforced thing? You really can't turn off the > cross-cpu-for-each-pte braindamage? Cheers, Ben. > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 21:16 ` Benjamin Herrenschmidt @ 2014-10-28 21:32 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2014-10-28 21:32 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux On Tue, Oct 28, 2014 at 2:16 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: >> >> Can't you just do a full invalidate and a SW IPI for larger ranges? > > For us, this would be great except ... we can potentially have other > agents with an MMU that only support snooping of the broadcasts... Ugh. Oh well. I guess on power you need to walk all the hashed entries individually _anyway_, so there's no way you could really use a range-based or "invalidate all" model to avoid some of the work. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 15:30 ` Linus Torvalds 2014-10-28 16:07 ` Will Deacon @ 2014-10-28 21:40 ` Linus Torvalds 2014-10-29 19:47 ` Will Deacon 1 sibling, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2014-10-28 21:40 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Oct 28, 2014 at 8:30 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote: >> >> This patch fixes the problem by incrementing addr by the PAGE_SIZE >> before breaking out of the loop on batch failure. > > This patch seems harmless and right [..] I've applied it (commit ce9ec37bddb6), and marked it for stable. I think that bug has been around since at least commit 2b047252d087 ("Fix TLB gather virtual address range invalidation corner cases") which went into 3.11, but that has in turn then was also marked for stable, so I'm not sure just how far back this fix needs to go. I suspect the simple answer is "as far back as it applies" ;) I'll wait and see what you'll do about the other patch. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-28 21:40 ` Linus Torvalds @ 2014-10-29 19:47 ` Will Deacon 2014-10-29 21:11 ` Linus Torvalds 2014-11-04 14:29 ` Catalin Marinas 0 siblings, 2 replies; 27+ messages in thread From: Will Deacon @ 2014-10-29 19:47 UTC (permalink / raw) To: Linus Torvalds Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Oct 28, 2014 at 09:40:35PM +0000, Linus Torvalds wrote: > On Tue, Oct 28, 2014 at 8:30 AM, Linus Torvalds > <torvalds@linux-foundation.org> wrote: > > On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote: > >> > >> This patch fixes the problem by incrementing addr by the PAGE_SIZE > >> before breaking out of the loop on batch failure. > > > > This patch seems harmless and right [..] > > I've applied it (commit ce9ec37bddb6), and marked it for stable. > > I think that bug has been around since at least commit 2b047252d087 > ("Fix TLB gather virtual address range invalidation corner cases") > which went into 3.11, but that has in turn then was also marked for > stable, so I'm not sure just how far back this fix needs to go. I > suspect the simple answer is "as far back as it applies" ;) Thanks for that. > I'll wait and see what you'll do about the other patch. I've cooked up something (see below), but it unfortunately makes a couple of minor changes to powerpc and microblaze to address redefinitions of some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry). On the plus side, it tidies up the force_flush case in zap_pte_range quite nicely (assuming I didn't screw it up again). Cheers, Will --->8 commit f51dd616639dfbfe0685c82b47e0f31e4a34f16b Author: Will Deacon <will.deacon@arm.com> Date: Wed Oct 29 10:03:09 2014 +0000 mmu_gather: move minimal range calculations into generic code On architectures with hardware broadcasting of TLB invalidation messages , it makes sense to reduce the range of the mmu_gather structure when unmapping page ranges based on the dirty address information passed to tlb_remove_tlb_entry. arm64 already does this by directly manipulating the start/end fields of the gather structure, but this confuses the generic code which does not expect these fields to change and can end up calculating invalid, negative ranges when forcing a flush in zap_pte_range. This patch moves the minimal range calculation out of the arm64 code and into the generic implementation, simplifying zap_pte_range in the process (which no longer needs to care about start/end, since they will point to the appropriate ranges already). Signed-off-by: Will Deacon <will.deacon@arm.com> diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h index a82c0c5c8b52..a9c9df0f60ff 100644 --- a/arch/arm64/include/asm/tlb.h +++ b/arch/arm64/include/asm/tlb.h @@ -19,10 +19,6 @@ #ifndef __ASM_TLB_H #define __ASM_TLB_H -#define __tlb_remove_pmd_tlb_entry __tlb_remove_pmd_tlb_entry - -#include <asm-generic/tlb.h> - #include <linux/pagemap.h> #include <linux/swap.h> @@ -37,16 +33,8 @@ static inline void __tlb_remove_table(void *_table) #define tlb_remove_entry(tlb, entry) tlb_remove_page(tlb, entry) #endif /* CONFIG_HAVE_RCU_TABLE_FREE */ -/* - * There's three ways the TLB shootdown code is used: - * 1. Unmapping a range of vmas. See zap_page_range(), unmap_region(). - * tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called. - * 2. Unmapping all vmas. See exit_mmap(). - * tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called. - * Page tables will be freed. - * 3. Unmapping argument pages. See shift_arg_pages(). - * tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called. - */ +#include <asm-generic/tlb.h> + static inline void tlb_flush(struct mmu_gather *tlb) { if (tlb->fullmm) { @@ -54,54 +42,13 @@ static inline void tlb_flush(struct mmu_gather *tlb) } else if (tlb->end > 0) { struct vm_area_struct vma = { .vm_mm = tlb->mm, }; flush_tlb_range(&vma, tlb->start, tlb->end); - tlb->start = TASK_SIZE; - tlb->end = 0; - } -} - -static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr) -{ - if (!tlb->fullmm) { - tlb->start = min(tlb->start, addr); - tlb->end = max(tlb->end, addr + PAGE_SIZE); - } -} - -/* - * Memorize the range for the TLB flush. - */ -static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep, - unsigned long addr) -{ - tlb_add_flush(tlb, addr); -} - -/* - * In the case of tlb vma handling, we can optimise these away in the - * case where we're doing a full MM flush. When we're doing a munmap, - * the vmas are adjusted to only cover the region to be torn down. - */ -static inline void tlb_start_vma(struct mmu_gather *tlb, - struct vm_area_struct *vma) -{ - if (!tlb->fullmm) { - tlb->start = TASK_SIZE; - tlb->end = 0; } } -static inline void tlb_end_vma(struct mmu_gather *tlb, - struct vm_area_struct *vma) -{ - if (!tlb->fullmm) - tlb_flush(tlb); -} - static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr) { pgtable_page_dtor(pte); - tlb_add_flush(tlb, addr); tlb_remove_entry(tlb, pte); } @@ -109,7 +56,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr) { - tlb_add_flush(tlb, addr); tlb_remove_entry(tlb, virt_to_page(pmdp)); } #endif @@ -118,15 +64,8 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp, unsigned long addr) { - tlb_add_flush(tlb, addr); tlb_remove_entry(tlb, virt_to_page(pudp)); } #endif -static inline void __tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t *pmdp, - unsigned long address) -{ - tlb_add_flush(tlb, address); -} - #endif diff --git a/arch/microblaze/include/asm/tlb.h b/arch/microblaze/include/asm/tlb.h index 8aa97817cc8c..99b6ded54849 100644 --- a/arch/microblaze/include/asm/tlb.h +++ b/arch/microblaze/include/asm/tlb.h @@ -14,7 +14,6 @@ #define tlb_flush(tlb) flush_tlb_mm((tlb)->mm) #include <linux/pagemap.h> -#include <asm-generic/tlb.h> #ifdef CONFIG_MMU #define tlb_start_vma(tlb, vma) do { } while (0) @@ -22,4 +21,6 @@ #define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0) #endif +#include <asm-generic/tlb.h> + #endif /* _ASM_MICROBLAZE_TLB_H */ diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h index e9a9f60e596d..fc3ee06eab87 100644 --- a/arch/powerpc/include/asm/pgalloc.h +++ b/arch/powerpc/include/asm/pgalloc.h @@ -3,7 +3,6 @@ #ifdef __KERNEL__ #include <linux/mm.h> -#include <asm-generic/tlb.h> #ifdef CONFIG_PPC_BOOK3E extern void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address); @@ -14,6 +13,8 @@ static inline void tlb_flush_pgtable(struct mmu_gather *tlb, } #endif /* !CONFIG_PPC_BOOK3E */ +extern void tlb_remove_table(struct mmu_gather *tlb, void *table); + #ifdef CONFIG_PPC64 #include <asm/pgalloc-64.h> #else diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h index e2b428b0f7ba..20733fa518ae 100644 --- a/arch/powerpc/include/asm/tlb.h +++ b/arch/powerpc/include/asm/tlb.h @@ -27,6 +27,7 @@ #define tlb_start_vma(tlb, vma) do { } while (0) #define tlb_end_vma(tlb, vma) do { } while (0) +#define __tlb_remove_tlb_entry __tlb_remove_tlb_entry extern void tlb_flush(struct mmu_gather *tlb); diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 5672d7ea1fa0..340bc5c5ca2d 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -128,6 +128,46 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) tlb_flush_mmu(tlb); } +static inline void __tlb_adjust_range(struct mmu_gather *tlb, + unsigned long address) +{ + if (!tlb->fullmm) { + tlb->start = min(tlb->start, address); + tlb->end = max(tlb->end, address + PAGE_SIZE); + } +} + +static inline void __tlb_reset_range(struct mmu_gather *tlb) +{ + tlb->start = TASK_SIZE; + tlb->end = 0; +} + +/* + * In the case of tlb vma handling, we can optimise these away in the + * case where we're doing a full MM flush. When we're doing a munmap, + * the vmas are adjusted to only cover the region to be torn down. + */ +#ifndef tlb_start_vma +#define tlb_start_vma(tlb, vma) do { } while (0) +#endif + +#define __tlb_end_vma(tlb, vma) \ + do { \ + if (!tlb->fullmm) { \ + tlb_flush(tlb); \ + __tlb_reset_range(tlb); \ + } \ + } while (0) + +#ifndef tlb_end_vma +#define tlb_end_vma __tlb_end_vma +#endif + +#ifndef __tlb_remove_tlb_entry +#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0) +#endif + /** * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation. * @@ -138,6 +178,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) #define tlb_remove_tlb_entry(tlb, ptep, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __tlb_remove_tlb_entry(tlb, ptep, address); \ } while (0) @@ -152,12 +193,14 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \ } while (0) #define pte_free_tlb(tlb, ptep, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __pte_free_tlb(tlb, ptep, address); \ } while (0) @@ -165,6 +208,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) #define pud_free_tlb(tlb, pudp, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __pud_free_tlb(tlb, pudp, address); \ } while (0) #endif @@ -172,6 +216,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) #define pmd_free_tlb(tlb, pmdp, address) \ do { \ tlb->need_flush = 1; \ + __tlb_adjust_range(tlb, address); \ __pmd_free_tlb(tlb, pmdp, address); \ } while (0) diff --git a/mm/memory.c b/mm/memory.c index 3e503831e042..0bc940e41ec9 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -220,8 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long /* Is it from 0 to ~0? */ tlb->fullmm = !(start | (end+1)); tlb->need_flush_all = 0; - tlb->start = start; - tlb->end = end; tlb->need_flush = 0; tlb->local.next = NULL; tlb->local.nr = 0; @@ -232,6 +230,8 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb->batch = NULL; #endif + + __tlb_reset_range(tlb); } static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) @@ -241,6 +241,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb_table_flush(tlb); #endif + __tlb_reset_range(tlb); } static void tlb_flush_mmu_free(struct mmu_gather *tlb) @@ -1186,20 +1187,8 @@ again: arch_leave_lazy_mmu_mode(); /* Do the actual TLB flush before dropping ptl */ - if (force_flush) { - unsigned long old_end; - - /* - * Flush the TLB just for the previous segment, - * then update the range to be the remaining - * TLB range. - */ - old_end = tlb->end; - tlb->end = addr; + if (force_flush) tlb_flush_mmu_tlbonly(tlb); - tlb->start = addr; - tlb->end = old_end; - } pte_unmap_unlock(start_pte, ptl); /* ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-29 19:47 ` Will Deacon @ 2014-10-29 21:11 ` Linus Torvalds 2014-10-29 21:27 ` Benjamin Herrenschmidt 2014-11-04 14:29 ` Catalin Marinas 1 sibling, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2014-10-29 21:11 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Wed, Oct 29, 2014 at 12:47 PM, Will Deacon <will.deacon@arm.com> wrote: > > I've cooked up something (see below), but it unfortunately makes a couple > of minor changes to powerpc and microblaze to address redefinitions of > some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry). > > On the plus side, it tidies up the force_flush case in zap_pte_range > quite nicely (assuming I didn't screw it up again). Yes, this looks fine to me. Looks like a good cleanup, and moves more code to generic headers rather than having arch-specific stuff. Ben, can you check that this is ok on powerpc? Who else should double-check this due to having been involved in tlb flushing? But I think this is good to go, modulo checking other architectures for sanity. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-29 21:11 ` Linus Torvalds @ 2014-10-29 21:27 ` Benjamin Herrenschmidt 2014-11-01 17:01 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Benjamin Herrenschmidt @ 2014-10-29 21:27 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux On Wed, 2014-10-29 at 14:11 -0700, Linus Torvalds wrote: > On Wed, Oct 29, 2014 at 12:47 PM, Will Deacon <will.deacon@arm.com> wrote: > > > > I've cooked up something (see below), but it unfortunately makes a couple > > of minor changes to powerpc and microblaze to address redefinitions of > > some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry). > > > > On the plus side, it tidies up the force_flush case in zap_pte_range > > quite nicely (assuming I didn't screw it up again). > > Yes, this looks fine to me. Looks like a good cleanup, and moves more > code to generic headers rather than having arch-specific stuff. > > Ben, can you check that this is ok on powerpc? Who else should > double-check this due to having been involved in tlb flushing? But I > think this is good to go, modulo checking other architectures for > sanity. TLB flushing is only me I think, I'll engage my brain after breakfast and see if is all good. The difficulty for us is that SW loaded TLB, hash32 and hash64 are all very different and my brain's been good at swapping a lot of that stuff out lately... Cheers, Ben. ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-29 21:27 ` Benjamin Herrenschmidt @ 2014-11-01 17:01 ` Linus Torvalds 2014-11-01 20:25 ` Benjamin Herrenschmidt 2014-11-03 17:56 ` Will Deacon 0 siblings, 2 replies; 27+ messages in thread From: Linus Torvalds @ 2014-11-01 17:01 UTC (permalink / raw) To: Benjamin Herrenschmidt Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote: > > TLB flushing is only me I think, I'll engage my brain after breakfast > and see if is all good Ping? Breakfast is either long over, of you're starting to look a bit like Mr Creosote... Anyway, Will, I assume this is not a correctness issue for you, just an annoying performance issue. Right? Or is there actually some issue with the actual range not being set to be sufficiently large? Also, it strikes me that I *think* that you might be able to extend your patch to remove the whole "need_flush" field, since as far as I can tell, "tlb->need_flush" is now equivalent to "tlb->start < tlb->end". Of course, as long as we still require that "need_flush_all", that doesn't actually save us any space, so maybe it's not worth changing. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-01 17:01 ` Linus Torvalds @ 2014-11-01 20:25 ` Benjamin Herrenschmidt 2014-11-03 17:56 ` Will Deacon 1 sibling, 0 replies; 27+ messages in thread From: Benjamin Herrenschmidt @ 2014-11-01 20:25 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux On Sat, 2014-11-01 at 10:01 -0700, Linus Torvalds wrote: > On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > > > TLB flushing is only me I think, I'll engage my brain after breakfast > > and see if is all good > > Ping? Breakfast is either long over, of you're starting to look a bit > like Mr Creosote... Argh... dropped that ball. > Anyway, Will, I assume this is not a correctness issue for you, just > an annoying performance issue. Right? Or is there actually some issue > with the actual range not being set to be sufficiently large? It should be fine for us in term of correctness I think. We rely on the lazy mmu bits for batching/flushing on hash64, we use __tlb_remove_tlb_entry() for immediate flush on hash32 and the SW loaded TLB cases are pretty dumb here and should be generally unaffected. > Also, it strikes me that I *think* that you might be able to extend > your patch to remove the whole "need_flush" field, since as far as I > can tell, "tlb->need_flush" is now equivalent to "tlb->start < > tlb->end". Of course, as long as we still require that > "need_flush_all", that doesn't actually save us any space, so maybe > it's not worth changing. Cheers, Ben. > Linus > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-01 17:01 ` Linus Torvalds 2014-11-01 20:25 ` Benjamin Herrenschmidt @ 2014-11-03 17:56 ` Will Deacon 2014-11-03 18:05 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Will Deacon @ 2014-11-03 17:56 UTC (permalink / raw) To: Linus Torvalds Cc: Benjamin Herrenschmidt, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux On Sat, Nov 01, 2014 at 05:01:30PM +0000, Linus Torvalds wrote: > On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt > <benh@kernel.crashing.org> wrote: > > > > TLB flushing is only me I think, I'll engage my brain after breakfast > > and see if is all good > > Ping? Breakfast is either long over, of you're starting to look a bit > like Mr Creosote... Wafer thin patch? > Anyway, Will, I assume this is not a correctness issue for you, just > an annoying performance issue. Right? Or is there actually some issue > with the actual range not being set to be sufficiently large? Yeah, it's just a performance issue. For ranges over 1k pages, we end up flushing the whole TLB. > Also, it strikes me that I *think* that you might be able to extend > your patch to remove the whole "need_flush" field, since as far as I > can tell, "tlb->need_flush" is now equivalent to "tlb->start < > tlb->end". Of course, as long as we still require that > "need_flush_all", that doesn't actually save us any space, so maybe > it's not worth changing. We use `tlb->end > 0' in the arm64 backend, so I think you're right. I'll take a look in the name of cleanup and this can wait until 3.19. Will ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-03 17:56 ` Will Deacon @ 2014-11-03 18:05 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2014-11-03 18:05 UTC (permalink / raw) To: Will Deacon Cc: Benjamin Herrenschmidt, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux On Mon, Nov 3, 2014 at 9:56 AM, Will Deacon <will.deacon@arm.com> wrote: > > We use `tlb->end > 0' in the arm64 backend, so I think you're right. I'll > take a look in the name of cleanup and this can wait until 3.19. Ok, I'll just archive this thread for now, and expect a patch at some future date. And as far as I'm concerned, it's ok if it just comes in through the normal arm64 tree, with just a note in the pull request reminding me about why it also touches some other architectures. Thanks, Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-10-29 19:47 ` Will Deacon 2014-10-29 21:11 ` Linus Torvalds @ 2014-11-04 14:29 ` Catalin Marinas 2014-11-04 16:08 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Catalin Marinas @ 2014-11-04 14:29 UTC (permalink / raw) To: Will Deacon Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt (catching up with email after holiday) On Wed, Oct 29, 2014 at 07:47:39PM +0000, Will Deacon wrote: > mmu_gather: move minimal range calculations into generic code > > On architectures with hardware broadcasting of TLB invalidation messages > , it makes sense to reduce the range of the mmu_gather structure when > unmapping page ranges based on the dirty address information passed to > tlb_remove_tlb_entry. > > arm64 already does this by directly manipulating the start/end fields > of the gather structure, but this confuses the generic code which > does not expect these fields to change and can end up calculating > invalid, negative ranges when forcing a flush in zap_pte_range. > > This patch moves the minimal range calculation out of the arm64 code > and into the generic implementation, simplifying zap_pte_range in the > process (which no longer needs to care about start/end, since they will > point to the appropriate ranges already). > > Signed-off-by: Will Deacon <will.deacon@arm.com> Nice to see this clean-up for arm64, however I have a question below. > diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h > index 5672d7ea1fa0..340bc5c5ca2d 100644 > --- a/include/asm-generic/tlb.h > +++ b/include/asm-generic/tlb.h > @@ -128,6 +128,46 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) > tlb_flush_mmu(tlb); > } > > +static inline void __tlb_adjust_range(struct mmu_gather *tlb, > + unsigned long address) > +{ > + if (!tlb->fullmm) { > + tlb->start = min(tlb->start, address); > + tlb->end = max(tlb->end, address + PAGE_SIZE); > + } > +} Here __tlb_adjust_range() assumes end to be (start + PAGE_SIZE) always. > @@ -152,12 +193,14 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) > #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address) \ > do { \ > tlb->need_flush = 1; \ > + __tlb_adjust_range(tlb, address); \ > __tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \ > } while (0) [...] > #define pmd_free_tlb(tlb, pmdp, address) \ > do { \ > tlb->need_flush = 1; \ > + __tlb_adjust_range(tlb, address); \ > __pmd_free_tlb(tlb, pmdp, address); \ > } while (0) This would work on arm64 but is the PAGE_SIZE range enough for all architectures even when we flush a huge page or a pmd/pud table entry? The approach Peter Z took with his patches was to use pmd_addr_end(addr, TASK_SIZE) and change __tlb_adjust_range() to take start/end arguments: https://lkml.org/lkml/2011/3/7/302 -- Catalin ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-04 14:29 ` Catalin Marinas @ 2014-11-04 16:08 ` Linus Torvalds 2014-11-06 13:57 ` Catalin Marinas 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2014-11-04 16:08 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Nov 4, 2014 at 6:29 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > This would work on arm64 but is the PAGE_SIZE range enough for all > architectures even when we flush a huge page or a pmd/pud table entry? It pretty much had *better* be. For things like page tables caches (ie caching addresses "inside" the page tables, like x86 does), for legacy reasons, flushing an individual page had better flush the page table caches behind it. This is definitely how x86 works, for example. And if you have an architected non-legacy page table cache (which I'm not aware of anybody actually doing), you're going to have some architecturally explicit flushing for that, likely *separate* from a regular TLB entry flush, and thus you'd need more than just some range expansion.. And the logic is very similar for things like hugepages. Either a normal "TLB invalidate" insutrction anywhere in the hugepage will invalidate the whole hugepage), or you would have special instructions or rules for invalidating hugepages and you'd need more than just some range expansion. So in neither case does it make sense to expand the range, afaik. And it would hurt normal architectures. So if we ever find an architecture that would want something that odd, I think it is up to that architecture to do its own odd thing, not cause pain for others. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-04 16:08 ` Linus Torvalds @ 2014-11-06 13:57 ` Catalin Marinas 2014-11-06 17:53 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Catalin Marinas @ 2014-11-06 13:57 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Nov 04, 2014 at 04:08:27PM +0000, Linus Torvalds wrote: > On Tue, Nov 4, 2014 at 6:29 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > This would work on arm64 but is the PAGE_SIZE range enough for all > > architectures even when we flush a huge page or a pmd/pud table entry? > > It pretty much had *better* be. Thanks for confirming. > For things like page tables caches (ie caching addresses "inside" the > page tables, like x86 does), for legacy reasons, flushing an > individual page had better flush the page table caches behind it. This > is definitely how x86 works, for example. And if you have an > architected non-legacy page table cache (which I'm not aware of > anybody actually doing), you're going to have some architecturally > explicit flushing for that, likely *separate* from a regular TLB entry > flush, and thus you'd need more than just some range expansion.. On arm64 we have two types of TLB invalidation instructions, the standard one which flushes a pte entry together with the corresponding upper level page table cache and a "leaf" operation only for the pte. We don't use the latter in Linux (yet) but in theory it's more efficient. Anyway, even without special "leaf" operations, it would be useful to make the distinction between unmap_vmas() and free_pgtables() with regards to the ranges tracked by mmu_gather. For the former, tlb_flush() needs to flush the range in PAGE_SIZE increments (assuming a mix of small and huge pages). For the latter, PMD_SIZE increments would be enough. With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick but for x86 we can keep mmu_gather.need_flush only for pte clearing and remove need_flush = 1 from p*_free_tlb() functions. The arch specific tlb_flush() can take need_flush into account to change the range flushing increment or even ignore the second tlb_flush() triggered by tlb_finish_mmu() (after free_pgtables(), the ptes have been flushed via tlb_end_vma()). -- Catalin ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-06 13:57 ` Catalin Marinas @ 2014-11-06 17:53 ` Linus Torvalds 2014-11-06 18:38 ` Catalin Marinas 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2014-11-06 17:53 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Thu, Nov 6, 2014 at 5:57 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > Anyway, even without special "leaf" operations, it would be useful to > make the distinction between unmap_vmas() and free_pgtables() with > regards to the ranges tracked by mmu_gather. For the former, tlb_flush() > needs to flush the range in PAGE_SIZE increments (assuming a mix of > small and huge pages). For the latter, PMD_SIZE increments would be > enough. Why woyuld you *ever* care about the increments? Quite frankly, I think even the PAGE_SIZE thing is just (a) stupid and (b) misleading. It might actually be better to instead of tlb->end = max(tlb->end, address + PAGE_SIZE); it might as well be a simpler tlb->end = max(tlb->end, address+1) (Even the "+1" is kind of unnecessary, but it's there to distinguish the zero address from the initial condition). The thing is, absolutely *no* architecture has a TLB flush that involves giving the start and end of the page you want to flush. None. Nada. Zilch. Trying to think of it as a "range of bytes I need to flush" is wrong. And it's misleading, and it makes people think in stupid ways like "I should change PAGE_SIZE to PMD_SIZE when I flush a PMD". That's *wrong*. Every single TLB flush is about a variation of "flush the TLB entry that contains the mapping for this address". The whole "add page-size" is pointless, because the *flushing* is not about giving a page-sized range, it's about giving *one* address in that page. Using "address+1" is actually equally descriptive of what the range is ("flush the tlb entry that maps this *byte*"), but is less amenable to the above kind of silly PMD_SIZE confusion. Because the exact same thing that is true for a single page is true for a page table operation too. There is just a *single* page table directory entry cache, it's not like you have one page table directory entry cache for each page. So what matters for the non-leaf operations is not size. Not AT ALL. It's a totally different operation, and you'd need not a different size, but a different flag entirely - the same way we already have a different flag for the "full_mm" case. And it's actually for exactly the same reason as "full_mm": you do the flush itself differently, it's not that the range is different. If it was just about the range, then "full_mm" would just initialize the range to the whole VM. But it *isn't* about the range. It's about the fact that a full-MM tear-down can fundamentally do different operations, knowing that there are no other threads using that VM any more. So I really really object to playing games with PMD_SIZE or other idiocies, because it fundamentally mis-states the whole problem. If ARM64 wants to make the "lead vs non-leaf" TLB operation, then you need to add a new flag, and you just set that flag when you tear down a page table (as opposed to just a PTE). It doesn't affect the "range" at all in any way as far as I can tell. > With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick > but for x86 we can keep mmu_gather.need_flush only for pte clearing > and remove need_flush = 1 from p*_free_tlb() functions. This is more confusion about what is going on. I'd actually really really prefer to have the "need_flush = 1" for the page table tear-down case even for x86. No, if you never removed any PTE at all, it is possible that it's not actually needed because an x86 CPU isn't supposed to cache non-present page table entries (so if you could tear down the page tables because there were no entries, there should be no TLB entries, and there *hopefully* should be no caches of mid-level page tables either that need a TLB invalidate). But in practice, I'd not take that chance. If you tear down a page table, you should flush the TLB in that range (and note how I say *in* that range - an invalidate anywhere in the range should be sufficient, not "over the whole range"!), because quite frankly, from an implementation standpoint, I really think it's the sane and safe thing to do. So I would suggest you think of the x86 invlpg instruction as your "non-leaf invalidate". The same way you'd want to do non-leaf invalidate whenever you tear down a page table, you'd do "invlpg" on x86. And no, we should *not* play games with "tlb->local.next". That just sounds completely and utterly insane. That's a hack, it's unclear, it's stupid, and it's connected to a totally irrelevant implementation detail, namely that random RCU freeing. Set a flag, for chrissake. Just say "when you free a pmd/pud/pgd, set tlb->need_flush_inner to let the flusher know" (probably in *addition* to "tlb->need_flush", just to maintain that rule). Make it explicit, and make it obvious, and don't play games. And don't make it be about range sizes. Because I really don't see how the exact end/start could ever be relevant. TLB's aren't byte-based, they are entry-based. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-06 17:53 ` Linus Torvalds @ 2014-11-06 18:38 ` Catalin Marinas 2014-11-06 21:29 ` Linus Torvalds 0 siblings, 1 reply; 27+ messages in thread From: Catalin Marinas @ 2014-11-06 18:38 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote: > On Thu, Nov 6, 2014 at 5:57 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > > > > Anyway, even without special "leaf" operations, it would be useful to > > make the distinction between unmap_vmas() and free_pgtables() with > > regards to the ranges tracked by mmu_gather. For the former, tlb_flush() > > needs to flush the range in PAGE_SIZE increments (assuming a mix of > > small and huge pages). For the latter, PMD_SIZE increments would be > > enough. > > Why woyuld you *ever* care about the increments? Sorry, I wasn't clear enough about the "increments" part. I agreed with not using end = start + PMD_SIZE/PAGE_SIZE from your previous email already. The flush_tlb_range() implementation on ARM(64) uses a loop that goes over the given range in PAGE_SIZE increments. This is fine and even optimal when we flush the PTEs. But it could be even faster when we go over the same range again and only need to flush the page table cache (PMD/PUD/PGD). A new flush_tlb_table_range() function could loop in PMD_SIZE increments. That's an arm64-only implementation of the TLB range flushing, I'm not suggesting the PAGE_SIZE/PMD_SIZE increments when setting mmu_gather.end at all. > Quite frankly, I think even the PAGE_SIZE thing is just (a) stupid and > (b) misleading. > > It might actually be better to instead of > > tlb->end = max(tlb->end, address + PAGE_SIZE); > > it might as well be a simpler > > tlb->end = max(tlb->end, address+1) I fully agree. One minor drawback is that the TLB invalidation instructions on ARM work on pfn and end >> PAGE_SHIFT would make it equal to start. It can be fixed in the arch code though. > So what matters for the non-leaf operations is not size. Not AT ALL. > It's a totally different operation, and you'd need not a different > size, but a different flag entirely - the same way we already have a > different flag for the "full_mm" case. And it's actually for exactly > the same reason as "full_mm": you do the flush itself differently, > it's not that the range is different. If it was just about the range, > then "full_mm" would just initialize the range to the whole VM. But it > *isn't* about the range. It's about the fact that a full-MM tear-down > can fundamentally do different operations, knowing that there are no > other threads using that VM any more. > > So I really really object to playing games with PMD_SIZE or other > idiocies, because it fundamentally mis-states the whole problem. That's not what I suggesting (though I agree I wasn't clear). The use of PMD_SIZE steps in a new flush_tlb_table_range() loop is entirely and arch-specific optimisation. Only that the arch code doesn't currently know which tlb_flush() it should use because need_flush is set for both PTEs and page table tear down. We just need different flags here to be able to optimise the arch code further. > If ARM64 wants to make the "lead vs non-leaf" TLB operation, then you > need to add a new flag, and you just set that flag when you tear down > a page table (as opposed to just a PTE). Indeed. Actually we could use need_flag only for PTEs and ignore it for page table tear-down. With Will's patch, we can already check tlb->end for what need_flush is currently doing and use need_flush in an arch-specific way (and we could give a new name as well). > > With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick > > but for x86 we can keep mmu_gather.need_flush only for pte clearing > > and remove need_flush = 1 from p*_free_tlb() functions. > > This is more confusion about what is going on. Yes, and if we do this we may no longer understand the code in few weeks time. > I'd actually really really prefer to have the "need_flush = 1" for the > page table tear-down case even for x86. No, if you never removed any > PTE at all, it is possible that it's not actually needed because an > x86 CPU isn't supposed to cache non-present page table entries (so if > you could tear down the page tables because there were no entries, > there should be no TLB entries, and there *hopefully* should be no > caches of mid-level page tables either that need a TLB invalidate). On ARM, as long as an intermediate page table entry is valid, even though the full translation (PTE) is not, the CPU can go and cache it. What's worse (and we've hit it before) is that it may even end up reading something that looks like a valid PTE (of the PTE page has been freed before the TLB invalidation) and it will stick around as a full translation. So we need to flush the page table cache before freeing the page table pages. > But in practice, I'd not take that chance. If you tear down a page > table, you should flush the TLB in that range (and note how I say *in* > that range - an invalidate anywhere in the range should be sufficient, > not "over the whole range"!), because quite frankly, from an > implementation standpoint, I really think it's the sane and safe thing > to do. A single TLB is indeed enough for a single page table page removed. If we queue multiple page table pages freeing, we accumulate the range via pmd_free_tlb() etc. and we would eventually need multiple TLB invalidations, at most one every PMD_SIZE (that's when !fullmm). > So I would suggest you think of the x86 invlpg instruction as your > "non-leaf invalidate". The same way you'd want to do non-leaf > invalidate whenever you tear down a page table, you'd do "invlpg" on > x86. I need to dig some more in the x86 code, I'm not familiar with it. We could do the page table cache invalidation non-lazily every time pmd_free_tlb() is called, though it's not as optimal as we need a heavy DSB barrier on ARM after each TLB invalidate. > And no, we should *not* play games with "tlb->local.next". That just > sounds completely and utterly insane. That's a hack, it's unclear, > it's stupid, and it's connected to a totally irrelevant implementation > detail, namely that random RCU freeing. > > Set a flag, for chrissake. Just say "when you free a pmd/pud/pgd, set > tlb->need_flush_inner to let the flusher know" (probably in *addition* > to "tlb->need_flush", just to maintain that rule). Make it explicit, > and make it obvious, and don't play games. I agree. -- Catalin ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-06 18:38 ` Catalin Marinas @ 2014-11-06 21:29 ` Linus Torvalds 2014-11-07 16:50 ` Catalin Marinas 0 siblings, 1 reply; 27+ messages in thread From: Linus Torvalds @ 2014-11-06 21:29 UTC (permalink / raw) To: Catalin Marinas Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Thu, Nov 6, 2014 at 10:38 AM, Catalin Marinas <catalin.marinas@arm.com> wrote: > On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote: > > Sorry, I wasn't clear enough about the "increments" part. I agreed with > not using end = start + PMD_SIZE/PAGE_SIZE from your previous email > already. Ahh, I misunderstood. You're really just after the granularity of tlb flushes. That's fine. That makes sense. In fact, how about adding "granularity" to the mmu_gather structure, and then doing:\ - in __tlb_reset_range(), setting it to ~0ul - add "granularity" to __tlb_adjust_range(), and make it do something like if (!tlb->fullmm) { tlb->granularity = min(tlb->granularity, granularity); tlb->start = min(tlb->start, address); tlb->end = max(tlb->end, address+1); } and then the TLB flush logic would basically do address = tlb->start; do { flush(address); if (address + tlb->granularity < address) break; address = address + tlb->granularity; } while (address < tlb->end); or something like that. Now, if you unmap mixed ranges of large-pages and regular pages, you'd still have that granularity of one page, but quite frankly, if you do that, you probably deserve it. The common case is almost certainly going to be just "unmap large pages" or "unmap normal pages". And if it turns out that I'm completely wrong, and mixed granularities are common, maybe there could be some hack in the "tlb->granularity" calculations that just forces a TLB flush when the granularity changes. Hmm? Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-06 21:29 ` Linus Torvalds @ 2014-11-07 16:50 ` Catalin Marinas 2014-11-10 13:56 ` Will Deacon 0 siblings, 1 reply; 27+ messages in thread From: Catalin Marinas @ 2014-11-07 16:50 UTC (permalink / raw) To: Linus Torvalds Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Thu, Nov 06, 2014 at 09:29:54PM +0000, Linus Torvalds wrote: > On Thu, Nov 6, 2014 at 10:38 AM, Catalin Marinas > <catalin.marinas@arm.com> wrote: > > On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote: > > > > Sorry, I wasn't clear enough about the "increments" part. I agreed with > > not using end = start + PMD_SIZE/PAGE_SIZE from your previous email > > already. > > Ahh, I misunderstood. You're really just after the granularity of tlb flushes. Yes. The granularity would also help when tearing down page tables as the granule would be PMD_SIZE. > That's fine. That makes sense. In fact, how about adding "granularity" > to the mmu_gather structure, and then doing:\ > > - in __tlb_reset_range(), setting it to ~0ul > > - add "granularity" to __tlb_adjust_range(), and make it do something like > > if (!tlb->fullmm) { > tlb->granularity = min(tlb->granularity, granularity); > tlb->start = min(tlb->start, address); > tlb->end = max(tlb->end, address+1); > } > > and then the TLB flush logic would basically do > > address = tlb->start; > do { > flush(address); > if (address + tlb->granularity < address) > break; > address = address + tlb->granularity; > } while (address < tlb->end); > > or something like that. Indeed. We'll come up with a patch after Will's clean-up. > Now, if you unmap mixed ranges of large-pages and regular pages, you'd > still have that granularity of one page, but quite frankly, if you do > that, you probably deserve it. The common case is almost certainly > going to be just "unmap large pages" or "unmap normal pages". I think this could only happen with transparent huge pages that replaced small pages in an anonymous mapping. I don't think munmap'ing them happens very often. Thanks. -- Catalin ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure 2014-11-07 16:50 ` Catalin Marinas @ 2014-11-10 13:56 ` Will Deacon 0 siblings, 0 replies; 27+ messages in thread From: Will Deacon @ 2014-11-10 13:56 UTC (permalink / raw) To: Catalin Marinas Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Fri, Nov 07, 2014 at 04:50:04PM +0000, Catalin Marinas wrote: > On Thu, Nov 06, 2014 at 09:29:54PM +0000, Linus Torvalds wrote: > > That's fine. That makes sense. In fact, how about adding "granularity" > > to the mmu_gather structure, and then doing:\ > > > > - in __tlb_reset_range(), setting it to ~0ul > > > > - add "granularity" to __tlb_adjust_range(), and make it do something like > > > > if (!tlb->fullmm) { > > tlb->granularity = min(tlb->granularity, granularity); > > tlb->start = min(tlb->start, address); > > tlb->end = max(tlb->end, address+1); > > } > > > > and then the TLB flush logic would basically do > > > > address = tlb->start; > > do { > > flush(address); > > if (address + tlb->granularity < address) > > break; > > address = address + tlb->granularity; > > } while (address < tlb->end); > > > > or something like that. > > Indeed. We'll come up with a patch after Will's clean-up. My clean-up is the patch I sent previously, plus the removal of need_flush. Incremental diff for the latter part below. We drop a set of need_flush from tlb_remove_table, but I can't figure out why it was there in the first place (need_flush was already set by pXd_free_tlb). Will --->8 diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h index a9c9df0f60ff..c028fe37456f 100644 --- a/arch/arm64/include/asm/tlb.h +++ b/arch/arm64/include/asm/tlb.h @@ -39,7 +39,7 @@ static inline void tlb_flush(struct mmu_gather *tlb) { if (tlb->fullmm) { flush_tlb_mm(tlb->mm); - } else if (tlb->end > 0) { + } else { struct vm_area_struct vma = { .vm_mm = tlb->mm, }; flush_tlb_range(&vma, tlb->start, tlb->end); } diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h index 340bc5c5ca2d..08848050922e 100644 --- a/include/asm-generic/tlb.h +++ b/include/asm-generic/tlb.h @@ -96,10 +96,9 @@ struct mmu_gather { #endif unsigned long start; unsigned long end; - unsigned int need_flush : 1, /* Did free PTEs */ /* we are in the middle of an operation to clear * a full mm and can make some optimizations */ - fullmm : 1, + unsigned int fullmm : 1, /* we have performed an operation which * requires a complete flush of the tlb */ need_flush_all : 1; @@ -131,10 +130,8 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page) static inline void __tlb_adjust_range(struct mmu_gather *tlb, unsigned long address) { - if (!tlb->fullmm) { - tlb->start = min(tlb->start, address); - tlb->end = max(tlb->end, address + PAGE_SIZE); - } + tlb->start = min(tlb->start, address); + tlb->end = max(tlb->end, address + PAGE_SIZE); } static inline void __tlb_reset_range(struct mmu_gather *tlb) @@ -154,7 +151,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) #define __tlb_end_vma(tlb, vma) \ do { \ - if (!tlb->fullmm) { \ + if (!tlb->fullmm && tlb->end) { \ tlb_flush(tlb); \ __tlb_reset_range(tlb); \ } \ @@ -171,13 +168,12 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) /** * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation. * - * Record the fact that pte's were really umapped in ->need_flush, so we can - * later optimise away the tlb invalidate. This helps when userspace is - * unmapping already-unmapped pages, which happens quite a lot. + * Record the fact that pte's were really unmapped by updating the range, + * so we can later optimise away the tlb invalidate. This helps when + * userspace is unmapping already-unmapped pages, which happens quite a lot. */ #define tlb_remove_tlb_entry(tlb, ptep, address) \ do { \ - tlb->need_flush = 1; \ __tlb_adjust_range(tlb, address); \ __tlb_remove_tlb_entry(tlb, ptep, address); \ } while (0) @@ -192,14 +188,12 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address) \ do { \ - tlb->need_flush = 1; \ __tlb_adjust_range(tlb, address); \ __tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \ } while (0) #define pte_free_tlb(tlb, ptep, address) \ do { \ - tlb->need_flush = 1; \ __tlb_adjust_range(tlb, address); \ __pte_free_tlb(tlb, ptep, address); \ } while (0) @@ -207,7 +201,6 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) #ifndef __ARCH_HAS_4LEVEL_HACK #define pud_free_tlb(tlb, pudp, address) \ do { \ - tlb->need_flush = 1; \ __tlb_adjust_range(tlb, address); \ __pud_free_tlb(tlb, pudp, address); \ } while (0) @@ -215,7 +208,6 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb) #define pmd_free_tlb(tlb, pmdp, address) \ do { \ - tlb->need_flush = 1; \ __tlb_adjust_range(tlb, address); \ __pmd_free_tlb(tlb, pmdp, address); \ } while (0) diff --git a/mm/memory.c b/mm/memory.c index 0bc940e41ec9..8b1c1d2e7c67 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -220,7 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long /* Is it from 0 to ~0? */ tlb->fullmm = !(start | (end+1)); tlb->need_flush_all = 0; - tlb->need_flush = 0; tlb->local.next = NULL; tlb->local.nr = 0; tlb->local.max = ARRAY_SIZE(tlb->__pages); @@ -236,7 +235,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb) { - tlb->need_flush = 0; + if (!tlb->end) + return; + tlb_flush(tlb); #ifdef CONFIG_HAVE_RCU_TABLE_FREE tlb_table_flush(tlb); @@ -257,8 +258,6 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb) void tlb_flush_mmu(struct mmu_gather *tlb) { - if (!tlb->need_flush) - return; tlb_flush_mmu_tlbonly(tlb); tlb_flush_mmu_free(tlb); } @@ -293,7 +292,7 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page) { struct mmu_gather_batch *batch; - VM_BUG_ON(!tlb->need_flush); + VM_BUG_ON(!tlb->end); batch = tlb->active; batch->pages[batch->nr++] = page; @@ -360,8 +359,6 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table) { struct mmu_table_batch **batch = &tlb->batch; - tlb->need_flush = 1; - /* * When there's less then two users of this mm there cannot be a * concurrent page-table walk. ^ permalink raw reply related [flat|nested] 27+ messages in thread
* [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte 2014-10-28 11:44 [RFC PATCH 0/2] Fix a couple of issues with zap_pte_range and MMU gather Will Deacon 2014-10-28 11:44 ` [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure Will Deacon @ 2014-10-28 11:44 ` Will Deacon 2014-10-28 15:18 ` Linus Torvalds 1 sibling, 1 reply; 27+ messages in thread From: Will Deacon @ 2014-10-28 11:44 UTC (permalink / raw) To: torvalds, peterz; +Cc: linux-kernel, linux, benh, Will Deacon When we encounter a dirty page during unmap, we force a TLB invalidation to avoid a race with pte_mkclean and stale, dirty TLB entries in the CPU. This uses the same force_flush logic as the batch failure code, but since we don't break out of the loop when finding a dirty pte, tlb->end can be < addr as we only batch for present ptes. This can result in a negative range being passed to subsequent TLB invalidation calls, potentially leading to massive over-invalidation of the TLB (observed in practice running firefox on arm64). This patch fixes the issue by restricting the use of addr in the TLB range calculations. The first range then ends up covering tlb->start to min(tlb->end, addr), which corresponds to the currently batched range. The second range then covers anything remaining, which may still lead to a (much reduced) over-invalidation of the TLB. Signed-off-by: Will Deacon <will.deacon@arm.com> --- mm/memory.c | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index 3e503831e042..ea41508d41f3 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1194,11 +1194,10 @@ again: * then update the range to be the remaining * TLB range. */ - old_end = tlb->end; - tlb->end = addr; + tlb->end = old_end = min(tlb->end, addr); tlb_flush_mmu_tlbonly(tlb); - tlb->start = addr; - tlb->end = old_end; + tlb->start = old_end; + tlb->end = end; } pte_unmap_unlock(start_pte, ptl); -- 2.1.1 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte 2014-10-28 11:44 ` [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte Will Deacon @ 2014-10-28 15:18 ` Linus Torvalds 0 siblings, 0 replies; 27+ messages in thread From: Linus Torvalds @ 2014-10-28 15:18 UTC (permalink / raw) To: Will Deacon Cc: Peter Zijlstra, Linux Kernel Mailing List, Russell King - ARM Linux, Benjamin Herrenschmidt On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote: > @@ -1194,11 +1194,10 @@ again: > * then update the range to be the remaining > * TLB range. > */ > - old_end = tlb->end; > - tlb->end = addr; > + tlb->end = old_end = min(tlb->end, addr); > tlb_flush_mmu_tlbonly(tlb); > - tlb->start = addr; > - tlb->end = old_end; > + tlb->start = old_end; > + tlb->end = end; I don't think this is right. Setting "tlb->end = end" looks very wrong indeed, because "end" here inside zap_pte_range() is *not* the final end of the zap range, it is just the end of the current set of pte's. There's a reason the old code *saved* the old end value. You've now ripped that out, and use the "old_end" for something else entirely. Your arm64 version of tlb_add_flush() then hides the bug you just introduced by updating the end range for each page you encounter. But quite frankly, I think your problems are all fundamental to that very issue. You're playing games with start/end during the TLB flush itself, which is not how those things were designed to work. So now you break everything that *doesn't* do your arm games. Linus ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2014-11-10 13:56 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-10-28 11:44 [RFC PATCH 0/2] Fix a couple of issues with zap_pte_range and MMU gather Will Deacon 2014-10-28 11:44 ` [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure Will Deacon 2014-10-28 15:30 ` Linus Torvalds 2014-10-28 16:07 ` Will Deacon 2014-10-28 16:25 ` Linus Torvalds 2014-10-28 17:07 ` Will Deacon 2014-10-28 18:03 ` Linus Torvalds 2014-10-28 21:16 ` Benjamin Herrenschmidt 2014-10-28 21:32 ` Linus Torvalds 2014-10-28 21:40 ` Linus Torvalds 2014-10-29 19:47 ` Will Deacon 2014-10-29 21:11 ` Linus Torvalds 2014-10-29 21:27 ` Benjamin Herrenschmidt 2014-11-01 17:01 ` Linus Torvalds 2014-11-01 20:25 ` Benjamin Herrenschmidt 2014-11-03 17:56 ` Will Deacon 2014-11-03 18:05 ` Linus Torvalds 2014-11-04 14:29 ` Catalin Marinas 2014-11-04 16:08 ` Linus Torvalds 2014-11-06 13:57 ` Catalin Marinas 2014-11-06 17:53 ` Linus Torvalds 2014-11-06 18:38 ` Catalin Marinas 2014-11-06 21:29 ` Linus Torvalds 2014-11-07 16:50 ` Catalin Marinas 2014-11-10 13:56 ` Will Deacon 2014-10-28 11:44 ` [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte Will Deacon 2014-10-28 15:18 ` Linus Torvalds
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox