[RFC PATCH 0/2] Fix a couple of issues with zap_pte

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC PATCH 0/2] Fix a couple of issues with zap_pte_range and MMU gather
@ 2014-10-28 11:44 Will Deacon
  2014-10-28 11:44 ` [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure Will Deacon
  2014-10-28 11:44 ` [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte Will Deacon
  0 siblings, 2 replies; 27+ messages in thread
From: Will Deacon @ 2014-10-28 11:44 UTC (permalink / raw)
  To: torvalds, peterz; +Cc: linux-kernel, linux, benh, Will Deacon

Hi all,

This patch series attempts to fix a couple of issues I've noticed with
zap_pte_range and the MMU gather code on arm64.

Ths first fix resolves a TLB range truncation, which I found by code
inspection (this is on the batch failure path, which doesn't appear to
be regularly exercised on my system).

For the second fix, I'd really appreciate some comments. The problem is
that the architecture TLB batching implementation may update the start
and end fields of the gather structure, so that they actually cover only
a subset of the initial range set up by tlb_gather_mmu (based on calls
to tlb_remove_tlb_entry). In the force_flush case, zap_pte_range sets
these fields directly, which can result in a negative range if the
architecture has also updated the end address. The patch here uses
min(end, addr) as the end of the first range, which creates a second
range from that address to the end of the region. This results in a
potential over-invalidation on arm64, but I can't think of anything
better without updating (at least) the x86 tlb.h implementation.

Ideally, we'd let the architecture set start/end during the call to
tlb_flush_mmu_tlbonly (arm64 does this already in tlb_flush).

Thoughts?

Will

Will Deacon (2):
  zap_pte_range: update addr when forcing flush after TLB batching
    faiure
  zap_pte_range: fix partial TLB flushing in response to a dirty pte

 mm/memory.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

-- 
2.1.1

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 11:44 [RFC PATCH 0/2] Fix a couple of issues with zap_pte_range and MMU gather Will Deacon
@ 2014-10-28 11:44 ` Will Deacon
  2014-10-28 15:30   ` Linus Torvalds
  2014-10-28 11:44 ` [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte Will Deacon
  1 sibling, 1 reply; 27+ messages in thread
From: Will Deacon @ 2014-10-28 11:44 UTC (permalink / raw)
  To: torvalds, peterz; +Cc: linux-kernel, linux, benh, Will Deacon

When unmapping a range of pages in zap_pte_range, the page being
unmapped is added to an mmu_gather_batch structure for asynchronous
freeing. If we run out of space in the batch structure before the range
has been completely unmapped, then we break out of the loop, force a
TLB flush and free the pages that we have batched so far. If there are
further pages to unmap, then we resume the loop where we left off.

Unfortunately, we forget to update addr when we break out of the loop,
which causes us to truncate the range being invalidated as the end
address is exclusive. When we re-enter the loop at the same address, the
page has already been freed and the pte_present test will fail, meaning
that we do not reconsider the address for invalidation.

This patch fixes the problem by incrementing addr by the PAGE_SIZE
before breaking out of the loop on batch failure.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 mm/memory.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory.c b/mm/memory.c
index 1cc6bfbd872e..3e503831e042 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1147,6 +1147,7 @@ again:
 				print_bad_pte(vma, addr, ptent, page);
 			if (unlikely(!__tlb_remove_page(tlb, page))) {
 				force_flush = 1;
+				addr += PAGE_SIZE;
 				break;
 			}
 			continue;
-- 
2.1.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 11:44 ` [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure Will Deacon
@ 2014-10-28 15:30   ` Linus Torvalds
  2014-10-28 16:07     ` Will Deacon
  2014-10-28 21:40     ` Linus Torvalds
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2014-10-28 15:30 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> This patch fixes the problem by incrementing addr by the PAGE_SIZE
> before breaking out of the loop on batch failure.

This patch seems harmless and right, unlike the other one.

I'd be ok with changing the *generic* code to do the "update start/end
pointers in the mmu_gather structure", but then it really has to be
done in *generic* code. Move your arm64 start/end updates to
include/asm-generic/tlb.h, and change the initialization of start/end
entirely. Right now we initialize those things to the maximal range
(see tlb_gather_mmu()), and the arm64 logic seems to be to initialize
them to TASK_SIZE/0 respectively and then update start/end as you add
pages, so that you get the minimal range.

But because of this arm64 confusion (the "minimal range" really is
*not* how this is designed to work), the current existing
tlb_gather_mmu() does the wrong initialization.

In other words: my argument is that right now the arm64 code is just
*wrong*. I'd be ok with making it the right thing to do, but if so, it
needs to be made generic.

Are there any actual advantages to teh whole "minimal range" model? It
adds overhead and complexity, and the only case where it would seem to
be worth it is for benchmarks that do mmap/munmap in a loop and then
only map a single page. Normal loads don't tend to have those kinds of
"minimal range is very different from whole range" cases. Do you have
a real case for why it does that minimal range thing.

Because if you don't have a real case, I'd really suggest you get rid
of all the arm64 games with start/end. That still leaves this one
patch the correct thing to do (because even without the start/end
games the "need_flush" flag goes with the range, but it makes the
second patch a complete non-issue.

If you *do* have a real case, I think you need to modify your second patch to:

 - move the arm start/end updates from tlb_flush/tlb_add_flush to asm-generic

 - make tlb_gather_mmu() initialize start/end to TASK_SIZE/0 the same
way your tlb_flush does (so that the subsequent min/max games work).

so that *everybody* does the start/end games. I'd be ok with that.
What I'm not ok with is arm64 using the generic TLB gather way in odd
ways that then breaks code and results in things like your 2/2 patch
that fixes ARM64 but breaks x86.

Hmm? Is there something I'm missing?

                         Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 15:30   ` Linus Torvalds
@ 2014-10-28 16:07     ` Will Deacon
  2014-10-28 16:25       ` Linus Torvalds
  2014-10-28 21:40     ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Will Deacon @ 2014-10-28 16:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

Hi Linus,

On Tue, Oct 28, 2014 at 03:30:43PM +0000, Linus Torvalds wrote:
> On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
> > This patch fixes the problem by incrementing addr by the PAGE_SIZE
> > before breaking out of the loop on batch failure.
> 
> This patch seems harmless and right, unlike the other one.

Yes; the other patch was more of a discussion point, as I was really
struggling to solve the problem without changing arch code. It was also
broken, as you discovered.

> I'd be ok with changing the *generic* code to do the "update start/end
> pointers in the mmu_gather structure", but then it really has to be
> done in *generic* code. Move your arm64 start/end updates to
> include/asm-generic/tlb.h, and change the initialization of start/end
> entirely. Right now we initialize those things to the maximal range
> (see tlb_gather_mmu()), and the arm64 logic seems to be to initialize
> them to TASK_SIZE/0 respectively and then update start/end as you add
> pages, so that you get the minimal range.
> 
> But because of this arm64 confusion (the "minimal range" really is
> *not* how this is designed to work), the current existing
> tlb_gather_mmu() does the wrong initialization.
> 
> In other words: my argument is that right now the arm64 code is just
> *wrong*. I'd be ok with making it the right thing to do, but if so, it
> needs to be made generic.

Ok, that's useful, thanks. Out of curiosity, what *is* the current intention
of __tlb_remove_tlb_entry, if start/end shouldn't be touched by
architectures? Is it just for the PPC hash thing?

> Are there any actual advantages to teh whole "minimal range" model? It
> adds overhead and complexity, and the only case where it would seem to
> be worth it is for benchmarks that do mmap/munmap in a loop and then
> only map a single page. Normal loads don't tend to have those kinds of
> "minimal range is very different from whole range" cases. Do you have
> a real case for why it does that minimal range thing.

I was certainly seeing this issue trigger regularly when running firefox,
but I'll need to dig and find out the differences in range size.

> Because if you don't have a real case, I'd really suggest you get rid
> of all the arm64 games with start/end. That still leaves this one
> patch the correct thing to do (because even without the start/end
> games the "need_flush" flag goes with the range, but it makes the
> second patch a complete non-issue.

Since we have hardware broadcasting of TLB invalidations on ARM, it is
in our interest to keep the number of outstanding operations as small as
possible, particularly on large systems where we don't get the targetted
shootdown with a single message that you can perform using IPIs (i.e.
you can only broadcast to all or no CPUs, and that happens for each pte).

Now, what I'd actually like to do is to keep track of discrete ranges,
so that we can avoid sending a TLB invalidation for each PAGE_SIZE region
of a huge-page mapping. That information is lost with our current minimal
range scheme and it was whilst I was thinking about this that I noticed
we were getting passed negative ranges with the current implementation.

> If you *do* have a real case, I think you need to modify your second patch to:
> 
>  - move the arm start/end updates from tlb_flush/tlb_add_flush to asm-generic
> 
>  - make tlb_gather_mmu() initialize start/end to TASK_SIZE/0 the same
> way your tlb_flush does (so that the subsequent min/max games work).
> 
> so that *everybody* does the start/end games. I'd be ok with that.

I'll try and come up with something which addresses the above, and we can
go from there.

> What I'm not ok with is arm64 using the generic TLB gather way in odd
> ways that then breaks code and results in things like your 2/2 patch
> that fixes ARM64 but breaks x86.

Understood, that certainly wasn't my intention.

Cheers,

Will

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 16:07     ` Will Deacon
@ 2014-10-28 16:25       ` Linus Torvalds
  2014-10-28 17:07         ` Will Deacon
  2014-10-28 21:16         ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2014-10-28 16:25 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Oct 28, 2014 at 9:07 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> Ok, that's useful, thanks. Out of curiosity, what *is* the current intention
> of __tlb_remove_tlb_entry, if start/end shouldn't be touched by
> architectures? Is it just for the PPC hash thing?

I think it's both the PPC hash, and for "legacy reasons" (ie
architectures that don't use the generic code, and were converted from
the "invalidate as you walk the tables" without ever really fixing the
"you have to flush the TLB before you free the page, and do
batching").

It would be lovely if we could just drop it entirely, although
changing it to actively do the minimal range is fine too.

> I was certainly seeing this issue trigger regularly when running firefox,
> but I'll need to dig and find out the differences in range size.

I'm wondering whether that was perhaps because of the mix-up with
initialization of the range. Afaik, that would always break your
min/max thing for the first batch (and since the batches are fairly
large, "first" may be "only")

But hey. it's possible that firefox does some big mappings but only
populates the beginning. Most architectures don't tend to have
excessive glass jaws in this area: invalidating things page-by-page is
invariably so slow that at some point you just go "just do the whole
range".

> Since we have hardware broadcasting of TLB invalidations on ARM, it is
> in our interest to keep the number of outstanding operations as small as
> possible, particularly on large systems where we don't get the targetted
> shootdown with a single message that you can perform using IPIs (i.e.
> you can only broadcast to all or no CPUs, and that happens for each pte).

Do you seriously *have* to broadcast for each pte?

Because that is quite frankly moronic.  We batch things up in software
for a real good reason: doing things one entry at a time just cannot
ever scale. At some point (and that point is usually not even very far
away), it's much better to do a single invalidate over a range. The
cost of having to refill the TLB's is *much* smaller than the cost of
doing tons of cross-CPU invalidates.

That's true even for the cases where we track the CPU's involved in
that mapping, and only invalidate a small subset. With a "all CPU's
broadcast", the cross-over point must be even smaller. Doing thousands
of CPU broadcasts is just crazy, even if they are hw-accelerated.

Can't you just do a full invalidate and a SW IPI for larger ranges?

And as mentioned, true sparse mappings are actually fairly rare, so
making extra effort (and data structures) to have individual ranges
sounds crazy.

Is this some hw-enforced thing? You really can't turn off the
cross-cpu-for-each-pte braindamage?

                         Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 16:25       ` Linus Torvalds
@ 2014-10-28 17:07         ` Will Deacon
  2014-10-28 18:03           ` Linus Torvalds
  2014-10-28 21:16         ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 27+ messages in thread
From: Will Deacon @ 2014-10-28 17:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Oct 28, 2014 at 04:25:35PM +0000, Linus Torvalds wrote:
> On Tue, Oct 28, 2014 at 9:07 AM, Will Deacon <will.deacon@arm.com> wrote:
> > I was certainly seeing this issue trigger regularly when running firefox,
> > but I'll need to dig and find out the differences in range size.
> 
> I'm wondering whether that was perhaps because of the mix-up with
> initialization of the range. Afaik, that would always break your
> min/max thing for the first batch (and since the batches are fairly
> large, "first" may be "only")
> 
> But hey. it's possible that firefox does some big mappings but only
> populates the beginning. Most architectures don't tend to have
> excessive glass jaws in this area: invalidating things page-by-page is
> invariably so slow that at some point you just go "just do the whole
> range".
> 
> > Since we have hardware broadcasting of TLB invalidations on ARM, it is
> > in our interest to keep the number of outstanding operations as small as
> > possible, particularly on large systems where we don't get the targetted
> > shootdown with a single message that you can perform using IPIs (i.e.
> > you can only broadcast to all or no CPUs, and that happens for each pte).
> 
> Do you seriously *have* to broadcast for each pte?
> 
> Because that is quite frankly moronic.  We batch things up in software
> for a real good reason: doing things one entry at a time just cannot
> ever scale. At some point (and that point is usually not even very far
> away), it's much better to do a single invalidate over a range. The
> cost of having to refill the TLB's is *much* smaller than the cost of
> doing tons of cross-CPU invalidates.

I don't think that's necessarily true, at least not on the systems I'm
familiar with. A table walk can be comparatively expensive, particularly
when virtualisation is involved and the depth of the host and guest page
tables starts to grow -- we're talking >20 memory accesses per walk. By
contrast, the TLB invalidation messages are asynchronous and carried on
the interconnect (a DSB instruction is used to synchronise the updates).

> That's true even for the cases where we track the CPU's involved in
> that mapping, and only invalidate a small subset. With a "all CPU's
> broadcast", the cross-over point must be even smaller. Doing thousands
> of CPU broadcasts is just crazy, even if they are hw-accelerated.
> 
> Can't you just do a full invalidate and a SW IPI for larger ranges?

We already do that, but it's mainly there to catch *really* large ranges
(like the negative ones...), which can trigger the soft lockup detector.
The cases we've seen for this so far have been bugs (e.g. this thread and
also a related issue where we try to flush the whole of vmalloc space).

> And as mentioned, true sparse mappings are actually fairly rare, so
> making extra effort (and data structures) to have individual ranges
> sounds crazy.

Sure, I'll try and get some data on this. I'd like to resolve the THP case,
at least, which means keeping track of calls to __tlb_remove_pmd_tlb_entry.

> Is this some hw-enforced thing? You really can't turn off the
> cross-cpu-for-each-pte braindamage?

We could use IPIs if we wanted to and issue local TLB invalidations on
the targetted cores, but I'd be surprised if this showed an improvement
on ARM-based systems.

Will

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 17:07         ` Will Deacon
@ 2014-10-28 18:03           ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2014-10-28 18:03 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Oct 28, 2014 at 10:07 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> I don't think that's necessarily true, at least not on the systems I'm
> familiar with. A table walk can be comparatively expensive, particularly
> when virtualisation is involved and the depth of the host and guest page
> tables starts to grow -- we're talking >20 memory accesses per walk. By
> contrast, the TLB invalidation messages are asynchronous and carried on
> the interconnect (a DSB instruction is used to synchronise the updates).

">20 memory accesses per *walk*"? Isn't the ARM a regular table? So
once you've gone down to the pte level, it's just an array, regardless
of how many levels there are.

But I guess there are no actual multi-socket ARM's around in real
life, so you probably don't see the real scaling costs. Within a die,
you're probably right that the overhead is negligible.

                   Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 16:25       ` Linus Torvalds
  2014-10-28 17:07         ` Will Deacon
@ 2014-10-28 21:16         ` Benjamin Herrenschmidt
  2014-10-28 21:32           ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2014-10-28 21:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux

On Tue, 2014-10-28 at 09:25 -0700, Linus Torvalds wrote:

> > Since we have hardware broadcasting of TLB invalidations on ARM, it is
> > in our interest to keep the number of outstanding operations as small as
> > possible, particularly on large systems where we don't get the targetted
> > shootdown with a single message that you can perform using IPIs (i.e.
> > you can only broadcast to all or no CPUs, and that happens for each pte).
> 
> Do you seriously *have* to broadcast for each pte?

We do too, in current CPUs at least, it's sad ...

> Because that is quite frankly moronic.  We batch things up in software
> for a real good reason: doing things one entry at a time just cannot
> ever scale. At some point (and that point is usually not even very far
> away), it's much better to do a single invalidate over a range. The
> cost of having to refill the TLB's is *much* smaller than the cost of
> doing tons of cross-CPU invalidates.
> 
> That's true even for the cases where we track the CPU's involved in
> that mapping, and only invalidate a small subset. With a "all CPU's
> broadcast", the cross-over point must be even smaller. Doing thousands
> of CPU broadcasts is just crazy, even if they are hw-accelerated.
> 
> Can't you just do a full invalidate and a SW IPI for larger ranges?

For us, this would be great except ... we can potentially have other
agents with an MMU that only support snooping of the broadcasts...

> And as mentioned, true sparse mappings are actually fairly rare, so
> making extra effort (and data structures) to have individual ranges
> sounds crazy.
> 
> Is this some hw-enforced thing? You really can't turn off the
> cross-cpu-for-each-pte braindamage?

Cheers,
Ben.

>                          Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 21:16         ` Benjamin Herrenschmidt
@ 2014-10-28 21:32           ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2014-10-28 21:32 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux

On Tue, Oct 28, 2014 at 2:16 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>>
>> Can't you just do a full invalidate and a SW IPI for larger ranges?
>
> For us, this would be great except ... we can potentially have other
> agents with an MMU that only support snooping of the broadcasts...

Ugh. Oh well. I guess on power you need to walk all the hashed entries
individually _anyway_, so there's no way you could really use a
range-based or "invalidate all" model to avoid some of the work.

                  Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 15:30   ` Linus Torvalds
  2014-10-28 16:07     ` Will Deacon
@ 2014-10-28 21:40     ` Linus Torvalds
  2014-10-29 19:47       ` Will Deacon
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2014-10-28 21:40 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Oct 28, 2014 at 8:30 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
>>
>> This patch fixes the problem by incrementing addr by the PAGE_SIZE
>> before breaking out of the loop on batch failure.
>
> This patch seems harmless and right [..]

I've applied it (commit ce9ec37bddb6), and marked it for stable.

I think that bug has been around since at least commit 2b047252d087
("Fix TLB gather virtual address range invalidation corner cases")
which went into 3.11, but that has in turn then was also marked for
stable, so I'm not sure just how far back this fix needs to go. I
suspect the simple answer is "as far back as it applies" ;)

I'll wait and see what you'll do about the other patch.

                 Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-28 21:40     ` Linus Torvalds
@ 2014-10-29 19:47       ` Will Deacon
  2014-10-29 21:11         ` Linus Torvalds
  2014-11-04 14:29         ` Catalin Marinas
  0 siblings, 2 replies; 27+ messages in thread
From: Will Deacon @ 2014-10-29 19:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Oct 28, 2014 at 09:40:35PM +0000, Linus Torvalds wrote:
> On Tue, Oct 28, 2014 at 8:30 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
> >>
> >> This patch fixes the problem by incrementing addr by the PAGE_SIZE
> >> before breaking out of the loop on batch failure.
> >
> > This patch seems harmless and right [..]
> 
> I've applied it (commit ce9ec37bddb6), and marked it for stable.
> 
> I think that bug has been around since at least commit 2b047252d087
> ("Fix TLB gather virtual address range invalidation corner cases")
> which went into 3.11, but that has in turn then was also marked for
> stable, so I'm not sure just how far back this fix needs to go. I
> suspect the simple answer is "as far back as it applies" ;)

Thanks for that.

> I'll wait and see what you'll do about the other patch.

I've cooked up something (see below), but it unfortunately makes a couple
of minor changes to powerpc and microblaze to address redefinitions of
some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry).

On the plus side, it tidies up the force_flush case in zap_pte_range
quite nicely (assuming I didn't screw it up again).

Cheers,

Will

--->8

commit f51dd616639dfbfe0685c82b47e0f31e4a34f16b
Author: Will Deacon <will.deacon@arm.com>
Date:   Wed Oct 29 10:03:09 2014 +0000

    mmu_gather: move minimal range calculations into generic code
    
    On architectures with hardware broadcasting of TLB invalidation messages
    , it makes sense to reduce the range of the mmu_gather structure when
    unmapping page ranges based on the dirty address information passed to
    tlb_remove_tlb_entry.
    
    arm64 already does this by directly manipulating the start/end fields
    of the gather structure, but this confuses the generic code which
    does not expect these fields to change and can end up calculating
    invalid, negative ranges when forcing a flush in zap_pte_range.
    
    This patch moves the minimal range calculation out of the arm64 code
    and into the generic implementation, simplifying zap_pte_range in the
    process (which no longer needs to care about start/end, since they will
    point to the appropriate ranges already).
    
    Signed-off-by: Will Deacon <will.deacon@arm.com>

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index a82c0c5c8b52..a9c9df0f60ff 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -19,10 +19,6 @@
 #ifndef __ASM_TLB_H
 #define __ASM_TLB_H
 
-#define  __tlb_remove_pmd_tlb_entry __tlb_remove_pmd_tlb_entry
-
-#include <asm-generic/tlb.h>
-
 #include <linux/pagemap.h>
 #include <linux/swap.h>
 
@@ -37,16 +33,8 @@ static inline void __tlb_remove_table(void *_table)
 #define tlb_remove_entry(tlb, entry)	tlb_remove_page(tlb, entry)
 #endif /* CONFIG_HAVE_RCU_TABLE_FREE */
 
-/*
- * There's three ways the TLB shootdown code is used:
- *  1. Unmapping a range of vmas.  See zap_page_range(), unmap_region().
- *     tlb->fullmm = 0, and tlb_start_vma/tlb_end_vma will be called.
- *  2. Unmapping all vmas.  See exit_mmap().
- *     tlb->fullmm = 1, and tlb_start_vma/tlb_end_vma will be called.
- *     Page tables will be freed.
- *  3. Unmapping argument pages.  See shift_arg_pages().
- *     tlb->fullmm = 0, but tlb_start_vma/tlb_end_vma will not be called.
- */
+#include <asm-generic/tlb.h>
+
 static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	if (tlb->fullmm) {
@@ -54,54 +42,13 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 	} else if (tlb->end > 0) {
 		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
 		flush_tlb_range(&vma, tlb->start, tlb->end);
-		tlb->start = TASK_SIZE;
-		tlb->end = 0;
-	}
-}
-
-static inline void tlb_add_flush(struct mmu_gather *tlb, unsigned long addr)
-{
-	if (!tlb->fullmm) {
-		tlb->start = min(tlb->start, addr);
-		tlb->end = max(tlb->end, addr + PAGE_SIZE);
-	}
-}
-
-/*
- * Memorize the range for the TLB flush.
- */
-static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
-					  unsigned long addr)
-{
-	tlb_add_flush(tlb, addr);
-}
-
-/*
- * In the case of tlb vma handling, we can optimise these away in the
- * case where we're doing a full MM flush.  When we're doing a munmap,
- * the vmas are adjusted to only cover the region to be torn down.
- */
-static inline void tlb_start_vma(struct mmu_gather *tlb,
-				 struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm) {
-		tlb->start = TASK_SIZE;
-		tlb->end = 0;
 	}
 }
 
-static inline void tlb_end_vma(struct mmu_gather *tlb,
-			       struct vm_area_struct *vma)
-{
-	if (!tlb->fullmm)
-		tlb_flush(tlb);
-}
-
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 				  unsigned long addr)
 {
 	pgtable_page_dtor(pte);
-	tlb_add_flush(tlb, addr);
 	tlb_remove_entry(tlb, pte);
 }
 
@@ -109,7 +56,6 @@ static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 				  unsigned long addr)
 {
-	tlb_add_flush(tlb, addr);
 	tlb_remove_entry(tlb, virt_to_page(pmdp));
 }
 #endif
@@ -118,15 +64,8 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
 static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
 				  unsigned long addr)
 {
-	tlb_add_flush(tlb, addr);
 	tlb_remove_entry(tlb, virt_to_page(pudp));
 }
 #endif
 
-static inline void __tlb_remove_pmd_tlb_entry(struct mmu_gather *tlb, pmd_t *pmdp,
-						unsigned long address)
-{
-	tlb_add_flush(tlb, address);
-}
-
 #endif
diff --git a/arch/microblaze/include/asm/tlb.h b/arch/microblaze/include/asm/tlb.h
index 8aa97817cc8c..99b6ded54849 100644
--- a/arch/microblaze/include/asm/tlb.h
+++ b/arch/microblaze/include/asm/tlb.h
@@ -14,7 +14,6 @@
 #define tlb_flush(tlb)	flush_tlb_mm((tlb)->mm)
 
 #include <linux/pagemap.h>
-#include <asm-generic/tlb.h>
 
 #ifdef CONFIG_MMU
 #define tlb_start_vma(tlb, vma)		do { } while (0)
@@ -22,4 +21,6 @@
 #define __tlb_remove_tlb_entry(tlb, pte, address) do { } while (0)
 #endif
 
+#include <asm-generic/tlb.h>
+
 #endif /* _ASM_MICROBLAZE_TLB_H */
diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index e9a9f60e596d..fc3ee06eab87 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -3,7 +3,6 @@
 #ifdef __KERNEL__
 
 #include <linux/mm.h>
-#include <asm-generic/tlb.h>
 
 #ifdef CONFIG_PPC_BOOK3E
 extern void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address);
@@ -14,6 +13,8 @@ static inline void tlb_flush_pgtable(struct mmu_gather *tlb,
 }
 #endif /* !CONFIG_PPC_BOOK3E */
 
+extern void tlb_remove_table(struct mmu_gather *tlb, void *table);
+
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
 #else
diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index e2b428b0f7ba..20733fa518ae 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -27,6 +27,7 @@
 
 #define tlb_start_vma(tlb, vma)	do { } while (0)
 #define tlb_end_vma(tlb, vma)	do { } while (0)
+#define __tlb_remove_tlb_entry	__tlb_remove_tlb_entry
 
 extern void tlb_flush(struct mmu_gather *tlb);
 
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 5672d7ea1fa0..340bc5c5ca2d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -128,6 +128,46 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 		tlb_flush_mmu(tlb);
 }
 
+static inline void __tlb_adjust_range(struct mmu_gather *tlb,
+				      unsigned long address)
+{
+	if (!tlb->fullmm) {
+		tlb->start = min(tlb->start, address);
+		tlb->end = max(tlb->end, address + PAGE_SIZE);
+	}
+}
+
+static inline void __tlb_reset_range(struct mmu_gather *tlb)
+{
+	tlb->start = TASK_SIZE;
+	tlb->end = 0;
+}
+
+/*
+ * In the case of tlb vma handling, we can optimise these away in the
+ * case where we're doing a full MM flush.  When we're doing a munmap,
+ * the vmas are adjusted to only cover the region to be torn down.
+ */
+#ifndef tlb_start_vma
+#define tlb_start_vma(tlb, vma) do { } while (0)
+#endif
+
+#define __tlb_end_vma(tlb, vma)					\
+	do {							\
+		if (!tlb->fullmm) {				\
+			tlb_flush(tlb);				\
+			__tlb_reset_range(tlb);			\
+		}						\
+	} while (0)
+
+#ifndef tlb_end_vma
+#define tlb_end_vma	__tlb_end_vma
+#endif
+
+#ifndef __tlb_remove_tlb_entry
+#define __tlb_remove_tlb_entry(tlb, ptep, address) do { } while (0)
+#endif
+
 /**
  * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation.
  *
@@ -138,6 +178,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 #define tlb_remove_tlb_entry(tlb, ptep, address)		\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
 
@@ -152,12 +193,14 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)		\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__tlb_remove_pmd_tlb_entry(tlb, pmdp, address);	\
 	} while (0)
 
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
 
@@ -165,6 +208,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__pud_free_tlb(tlb, pudp, address);		\
 	} while (0)
 #endif
@@ -172,6 +216,7 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 #define pmd_free_tlb(tlb, pmdp, address)			\
 	do {							\
 		tlb->need_flush = 1;				\
+		__tlb_adjust_range(tlb, address);		\
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
 
diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..0bc940e41ec9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -220,8 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 	/* Is it from 0 to ~0? */
 	tlb->fullmm     = !(start | (end+1));
 	tlb->need_flush_all = 0;
-	tlb->start	= start;
-	tlb->end	= end;
 	tlb->need_flush = 0;
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
@@ -232,6 +230,8 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb->batch = NULL;
 #endif
+
+	__tlb_reset_range(tlb);
 }
 
 static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
@@ -241,6 +241,7 @@ static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
 #endif
+	__tlb_reset_range(tlb);
 }
 
 static void tlb_flush_mmu_free(struct mmu_gather *tlb)
@@ -1186,20 +1187,8 @@ again:
 	arch_leave_lazy_mmu_mode();
 
 	/* Do the actual TLB flush before dropping ptl */
-	if (force_flush) {
-		unsigned long old_end;
-
-		/*
-		 * Flush the TLB just for the previous segment,
-		 * then update the range to be the remaining
-		 * TLB range.
-		 */
-		old_end = tlb->end;
-		tlb->end = addr;
+	if (force_flush)
 		tlb_flush_mmu_tlbonly(tlb);
-		tlb->start = addr;
-		tlb->end = old_end;
-	}
 	pte_unmap_unlock(start_pte, ptl);
 
 	/*

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-29 19:47       ` Will Deacon
@ 2014-10-29 21:11         ` Linus Torvalds
  2014-10-29 21:27           ` Benjamin Herrenschmidt
  2014-11-04 14:29         ` Catalin Marinas
  1 sibling, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2014-10-29 21:11 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Wed, Oct 29, 2014 at 12:47 PM, Will Deacon <will.deacon@arm.com> wrote:
>
> I've cooked up something (see below), but it unfortunately makes a couple
> of minor changes to powerpc and microblaze to address redefinitions of
> some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry).
>
> On the plus side, it tidies up the force_flush case in zap_pte_range
> quite nicely (assuming I didn't screw it up again).

Yes, this looks fine to me. Looks like a good cleanup, and moves more
code to generic headers rather than having arch-specific stuff.

Ben, can you check that this is ok on powerpc? Who else should
double-check this due to having been involved in tlb flushing? But I
think this is good to go, modulo checking other architectures for
sanity.

                    Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-29 21:11         ` Linus Torvalds
@ 2014-10-29 21:27           ` Benjamin Herrenschmidt
  2014-11-01 17:01             ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2014-10-29 21:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux

On Wed, 2014-10-29 at 14:11 -0700, Linus Torvalds wrote:
> On Wed, Oct 29, 2014 at 12:47 PM, Will Deacon <will.deacon@arm.com> wrote:
> >
> > I've cooked up something (see below), but it unfortunately makes a couple
> > of minor changes to powerpc and microblaze to address redefinitions of
> > some of the gather callbacks (tlb{start,end}vma, __tlb_remove_tlb_entry).
> >
> > On the plus side, it tidies up the force_flush case in zap_pte_range
> > quite nicely (assuming I didn't screw it up again).
> 
> Yes, this looks fine to me. Looks like a good cleanup, and moves more
> code to generic headers rather than having arch-specific stuff.
> 
> Ben, can you check that this is ok on powerpc? Who else should
> double-check this due to having been involved in tlb flushing? But I
> think this is good to go, modulo checking other architectures for
> sanity.

TLB flushing is only me I think, I'll engage my brain after breakfast
and see if is all good. The difficulty for us is that SW loaded TLB,
hash32 and hash64 are all very different and my brain's been good at
swapping a lot of that stuff out lately...

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-29 21:27           ` Benjamin Herrenschmidt
@ 2014-11-01 17:01             ` Linus Torvalds
  2014-11-01 20:25               ` Benjamin Herrenschmidt
  2014-11-03 17:56               ` Will Deacon
  0 siblings, 2 replies; 27+ messages in thread
From: Linus Torvalds @ 2014-11-01 17:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux

On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
>
> TLB flushing is only me I think, I'll engage my brain after breakfast
> and see if is all good

Ping? Breakfast is either long over, of you're starting to look a bit
like Mr Creosote...

Anyway, Will, I assume this is not a correctness issue for you, just
an annoying performance issue. Right? Or is there actually some issue
with the actual range not being set to be sufficiently large?

Also, it strikes me that I *think* that you might be able to extend
your patch to remove the whole "need_flush" field, since as far as I
can tell, "tlb->need_flush" is now equivalent to "tlb->start <
tlb->end". Of course, as long as we still require that
"need_flush_all", that doesn't actually save us any space, so maybe
it's not worth changing.

                       Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-01 17:01             ` Linus Torvalds
@ 2014-11-01 20:25               ` Benjamin Herrenschmidt
  2014-11-03 17:56               ` Will Deacon
  1 sibling, 0 replies; 27+ messages in thread
From: Benjamin Herrenschmidt @ 2014-11-01 20:25 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux

On Sat, 2014-11-01 at 10:01 -0700, Linus Torvalds wrote:
> On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > TLB flushing is only me I think, I'll engage my brain after breakfast
> > and see if is all good
> 
> Ping? Breakfast is either long over, of you're starting to look a bit
> like Mr Creosote...

Argh... dropped that ball.

> Anyway, Will, I assume this is not a correctness issue for you, just
> an annoying performance issue. Right? Or is there actually some issue
> with the actual range not being set to be sufficiently large?

It should be fine for us in term of correctness I think. We rely on the
lazy mmu bits for batching/flushing on hash64, we use
__tlb_remove_tlb_entry() for immediate flush on hash32 and the SW loaded
TLB cases are pretty dumb here and should be generally unaffected.

> Also, it strikes me that I *think* that you might be able to extend
> your patch to remove the whole "need_flush" field, since as far as I
> can tell, "tlb->need_flush" is now equivalent to "tlb->start <
> tlb->end". Of course, as long as we still require that
> "need_flush_all", that doesn't actually save us any space, so maybe
> it's not worth changing.

Cheers,
Ben.

>                        Linus
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-01 17:01             ` Linus Torvalds
  2014-11-01 20:25               ` Benjamin Herrenschmidt
@ 2014-11-03 17:56               ` Will Deacon
  2014-11-03 18:05                 ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Will Deacon @ 2014-11-03 17:56 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Benjamin Herrenschmidt, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux

On Sat, Nov 01, 2014 at 05:01:30PM +0000, Linus Torvalds wrote:
> On Wed, Oct 29, 2014 at 2:27 PM, Benjamin Herrenschmidt
> <benh@kernel.crashing.org> wrote:
> >
> > TLB flushing is only me I think, I'll engage my brain after breakfast
> > and see if is all good
> 
> Ping? Breakfast is either long over, of you're starting to look a bit
> like Mr Creosote...

Wafer thin patch?

> Anyway, Will, I assume this is not a correctness issue for you, just
> an annoying performance issue. Right? Or is there actually some issue
> with the actual range not being set to be sufficiently large?

Yeah, it's just a performance issue. For ranges over 1k pages, we end up
flushing the whole TLB.

> Also, it strikes me that I *think* that you might be able to extend
> your patch to remove the whole "need_flush" field, since as far as I
> can tell, "tlb->need_flush" is now equivalent to "tlb->start <
> tlb->end". Of course, as long as we still require that
> "need_flush_all", that doesn't actually save us any space, so maybe
> it's not worth changing.

We use `tlb->end > 0' in the arm64 backend, so I think you're right. I'll
take a look in the name of cleanup and this can wait until 3.19.

Will

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-03 17:56               ` Will Deacon
@ 2014-11-03 18:05                 ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2014-11-03 18:05 UTC (permalink / raw)
  To: Will Deacon
  Cc: Benjamin Herrenschmidt, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux

On Mon, Nov 3, 2014 at 9:56 AM, Will Deacon <will.deacon@arm.com> wrote:
>
> We use `tlb->end > 0' in the arm64 backend, so I think you're right. I'll
> take a look in the name of cleanup and this can wait until 3.19.

Ok, I'll just archive this thread for now, and expect a patch at some
future date. And as far as I'm concerned, it's ok if it just comes in
through the normal arm64 tree, with just a note in the pull request
reminding me about why it also touches some other architectures.

Thanks,

                     Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-10-29 19:47       ` Will Deacon
  2014-10-29 21:11         ` Linus Torvalds
@ 2014-11-04 14:29         ` Catalin Marinas
  2014-11-04 16:08           ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Catalin Marinas @ 2014-11-04 14:29 UTC (permalink / raw)
  To: Will Deacon
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

(catching up with email after holiday)

On Wed, Oct 29, 2014 at 07:47:39PM +0000, Will Deacon wrote:
>     mmu_gather: move minimal range calculations into generic code
>     
>     On architectures with hardware broadcasting of TLB invalidation messages
>     , it makes sense to reduce the range of the mmu_gather structure when
>     unmapping page ranges based on the dirty address information passed to
>     tlb_remove_tlb_entry.
>     
>     arm64 already does this by directly manipulating the start/end fields
>     of the gather structure, but this confuses the generic code which
>     does not expect these fields to change and can end up calculating
>     invalid, negative ranges when forcing a flush in zap_pte_range.
>     
>     This patch moves the minimal range calculation out of the arm64 code
>     and into the generic implementation, simplifying zap_pte_range in the
>     process (which no longer needs to care about start/end, since they will
>     point to the appropriate ranges already).
>     
>     Signed-off-by: Will Deacon <will.deacon@arm.com>

Nice to see this clean-up for arm64, however I have a question below.

> diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
> index 5672d7ea1fa0..340bc5c5ca2d 100644
> --- a/include/asm-generic/tlb.h
> +++ b/include/asm-generic/tlb.h
> @@ -128,6 +128,46 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>  		tlb_flush_mmu(tlb);
>  }
>  
> +static inline void __tlb_adjust_range(struct mmu_gather *tlb,
> +				      unsigned long address)
> +{
> +	if (!tlb->fullmm) {
> +		tlb->start = min(tlb->start, address);
> +		tlb->end = max(tlb->end, address + PAGE_SIZE);
> +	}
> +}

Here __tlb_adjust_range() assumes end to be (start + PAGE_SIZE) always.

> @@ -152,12 +193,14 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
>  #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)		\
>  	do {							\
>  		tlb->need_flush = 1;				\
> +		__tlb_adjust_range(tlb, address);		\
>  		__tlb_remove_pmd_tlb_entry(tlb, pmdp, address);	\
>  	} while (0)

[...]

>  #define pmd_free_tlb(tlb, pmdp, address)			\
>  	do {							\
>  		tlb->need_flush = 1;				\
> +		__tlb_adjust_range(tlb, address);		\
>  		__pmd_free_tlb(tlb, pmdp, address);		\
>  	} while (0)

This would work on arm64 but is the PAGE_SIZE range enough for all
architectures even when we flush a huge page or a pmd/pud table entry?
The approach Peter Z took with his patches was to use pmd_addr_end(addr,
TASK_SIZE) and change __tlb_adjust_range() to take start/end arguments:

https://lkml.org/lkml/2011/3/7/302

-- 
Catalin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-04 14:29         ` Catalin Marinas
@ 2014-11-04 16:08           ` Linus Torvalds
  2014-11-06 13:57             ` Catalin Marinas
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2014-11-04 16:08 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Nov 4, 2014 at 6:29 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> This would work on arm64 but is the PAGE_SIZE range enough for all
> architectures even when we flush a huge page or a pmd/pud table entry?

It pretty much had *better* be.

For things like page tables caches (ie caching addresses "inside" the
page tables, like x86 does), for legacy reasons, flushing an
individual page had better flush the page table caches behind it. This
is definitely how x86 works, for example. And if you have an
architected non-legacy page table cache (which I'm not aware of
anybody actually doing), you're going to have some architecturally
explicit flushing for that, likely *separate* from a regular TLB entry
flush, and thus you'd need more than just some range expansion..

And the logic is very similar for things like hugepages. Either a
normal "TLB invalidate" insutrction anywhere in the hugepage will
invalidate the whole hugepage), or you would have special instructions
or rules for invalidating hugepages and you'd need more than just some
range expansion.

So in neither case does it make sense to expand the range, afaik. And
it would hurt normal architectures. So if we ever find an architecture
that would want something that odd, I think it is up to that
architecture to do its own odd thing, not cause pain for others.

                          Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-04 16:08           ` Linus Torvalds
@ 2014-11-06 13:57             ` Catalin Marinas
  2014-11-06 17:53               ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Catalin Marinas @ 2014-11-06 13:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Nov 04, 2014 at 04:08:27PM +0000, Linus Torvalds wrote:
> On Tue, Nov 4, 2014 at 6:29 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > This would work on arm64 but is the PAGE_SIZE range enough for all
> > architectures even when we flush a huge page or a pmd/pud table entry?
> 
> It pretty much had *better* be.

Thanks for confirming.

> For things like page tables caches (ie caching addresses "inside" the
> page tables, like x86 does), for legacy reasons, flushing an
> individual page had better flush the page table caches behind it. This
> is definitely how x86 works, for example. And if you have an
> architected non-legacy page table cache (which I'm not aware of
> anybody actually doing), you're going to have some architecturally
> explicit flushing for that, likely *separate* from a regular TLB entry
> flush, and thus you'd need more than just some range expansion..

On arm64 we have two types of TLB invalidation instructions, the
standard one which flushes a pte entry together with the corresponding
upper level page table cache and a "leaf" operation only for the pte. We
don't use the latter in Linux (yet) but in theory it's more efficient.

Anyway, even without special "leaf" operations, it would be useful to
make the distinction between unmap_vmas() and free_pgtables() with
regards to the ranges tracked by mmu_gather. For the former, tlb_flush()
needs to flush the range in PAGE_SIZE increments (assuming a mix of
small and huge pages). For the latter, PMD_SIZE increments would be
enough.

With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick
but for x86 we can keep mmu_gather.need_flush only for pte clearing
and remove need_flush = 1 from p*_free_tlb() functions. The arch
specific tlb_flush() can take need_flush into account to change the
range flushing increment or even ignore the second tlb_flush() triggered
by tlb_finish_mmu() (after free_pgtables(), the ptes have been flushed
via tlb_end_vma()).

-- 
Catalin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-06 13:57             ` Catalin Marinas
@ 2014-11-06 17:53               ` Linus Torvalds
  2014-11-06 18:38                 ` Catalin Marinas
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2014-11-06 17:53 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Thu, Nov 6, 2014 at 5:57 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
>
> Anyway, even without special "leaf" operations, it would be useful to
> make the distinction between unmap_vmas() and free_pgtables() with
> regards to the ranges tracked by mmu_gather. For the former, tlb_flush()
> needs to flush the range in PAGE_SIZE increments (assuming a mix of
> small and huge pages). For the latter, PMD_SIZE increments would be
> enough.

Why woyuld you *ever* care about the increments?

Quite frankly, I think even the PAGE_SIZE thing is just (a) stupid and
(b) misleading.

It might actually be better to instead of

    tlb->end = max(tlb->end, address + PAGE_SIZE);

it might as well be a simpler

    tlb->end = max(tlb->end, address+1)

(Even the "+1" is kind of unnecessary, but it's there to distinguish
the zero address from the initial condition).

The thing is, absolutely *no* architecture has a TLB flush that
involves giving the start and end of the page you want to flush. None.
Nada. Zilch. Trying to think of it as a "range of bytes I need to
flush" is wrong. And it's misleading, and it makes people think in
stupid ways like "I should change PAGE_SIZE to PMD_SIZE when I flush a
PMD". That's *wrong*.

Every single TLB flush is about a variation of "flush the TLB entry
that contains the mapping for this address". The whole "add page-size"
is pointless, because the *flushing* is not about giving a page-sized
range, it's about giving *one* address in that page.

Using "address+1" is actually equally descriptive of what the range is
("flush the tlb entry that maps this *byte*"), but is less amenable to
the above kind of silly PMD_SIZE confusion. Because the exact same
thing that is true for a single page is true for a page table
operation too. There is just a *single* page table directory entry
cache, it's not like you have one page table directory entry cache for
each page.

So what matters for the non-leaf operations is not size. Not AT ALL.
It's a totally different operation, and you'd need not a different
size, but a different flag entirely - the same way we already have a
different flag for the "full_mm" case. And it's actually for exactly
the same reason as "full_mm": you do the flush itself differently,
it's not that the range is different. If it was just about the range,
then "full_mm" would just initialize the range to the whole VM. But it
*isn't* about the range. It's about the fact that a full-MM tear-down
can fundamentally do different operations, knowing that there are no
other threads using that VM any more.

So I really really object to playing games with PMD_SIZE or other
idiocies, because it fundamentally mis-states the whole problem.

If ARM64 wants to make the "lead vs non-leaf" TLB operation, then you
need to add a new flag, and you just set that flag when you tear down
a page table (as opposed to just a PTE).

It doesn't affect the "range" at all in any way as far as I can tell.

> With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick
> but for x86 we can keep mmu_gather.need_flush only for pte clearing
> and remove need_flush = 1 from p*_free_tlb() functions.

This is more confusion about what is going on.

I'd actually really really prefer to have the "need_flush = 1" for the
page table tear-down case even for x86. No, if you never removed any
PTE at all, it is possible that it's not actually needed because an
x86 CPU isn't supposed to cache non-present page table entries (so if
you could tear down the page tables because there were no entries,
there should be no TLB entries, and there *hopefully* should be no
caches of mid-level page tables either that need a TLB invalidate).
But in practice, I'd not take that chance. If you tear down a page
table, you should flush the TLB in that range (and note how I say *in*
that range - an invalidate anywhere in the range should be sufficient,
not "over the whole range"!), because quite frankly, from an
implementation standpoint, I really think it's the sane and safe thing
to do.

So I would suggest you think of the x86 invlpg instruction as your
"non-leaf invalidate". The same way you'd want to do non-leaf
invalidate whenever you tear down a page table, you'd do "invlpg" on
x86.

And no, we should *not* play games with "tlb->local.next". That just
sounds completely and utterly insane. That's a hack, it's unclear,
it's stupid, and it's connected to a totally irrelevant implementation
detail, namely that random RCU freeing.

Set a flag, for chrissake. Just say "when you free a pmd/pud/pgd, set
tlb->need_flush_inner to let the flusher know" (probably in *addition*
to "tlb->need_flush", just to maintain that rule). Make it explicit,
and make it obvious, and don't play games.

And don't make it be about range sizes. Because I really don't see how
the exact end/start could ever be relevant. TLB's aren't byte-based,
they are entry-based.

                         Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-06 17:53               ` Linus Torvalds
@ 2014-11-06 18:38                 ` Catalin Marinas
  2014-11-06 21:29                   ` Linus Torvalds
  0 siblings, 1 reply; 27+ messages in thread
From: Catalin Marinas @ 2014-11-06 18:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote:
> On Thu, Nov 6, 2014 at 5:57 AM, Catalin Marinas <catalin.marinas@arm.com> wrote:
> >
> > Anyway, even without special "leaf" operations, it would be useful to
> > make the distinction between unmap_vmas() and free_pgtables() with
> > regards to the ranges tracked by mmu_gather. For the former, tlb_flush()
> > needs to flush the range in PAGE_SIZE increments (assuming a mix of
> > small and huge pages). For the latter, PMD_SIZE increments would be
> > enough.
> 
> Why woyuld you *ever* care about the increments?

Sorry, I wasn't clear enough about the "increments" part. I agreed with
not using end = start + PMD_SIZE/PAGE_SIZE from your previous email
already.

The flush_tlb_range() implementation on ARM(64) uses a loop that goes
over the given range in PAGE_SIZE increments. This is fine and even
optimal when we flush the PTEs. But it could be even faster when we go
over the same range again and only need to flush the page table cache
(PMD/PUD/PGD). A new flush_tlb_table_range() function could loop in
PMD_SIZE increments. That's an arm64-only implementation of the TLB
range flushing, I'm not suggesting the PAGE_SIZE/PMD_SIZE increments
when setting mmu_gather.end at all.

> Quite frankly, I think even the PAGE_SIZE thing is just (a) stupid and
> (b) misleading.
> 
> It might actually be better to instead of
> 
>     tlb->end = max(tlb->end, address + PAGE_SIZE);
> 
> it might as well be a simpler
> 
>     tlb->end = max(tlb->end, address+1)

I fully agree.

One minor drawback is that the TLB invalidation instructions on ARM work
on pfn and end >> PAGE_SHIFT would make it equal to start. It can be
fixed in the arch code though.

> So what matters for the non-leaf operations is not size. Not AT ALL.
> It's a totally different operation, and you'd need not a different
> size, but a different flag entirely - the same way we already have a
> different flag for the "full_mm" case. And it's actually for exactly
> the same reason as "full_mm": you do the flush itself differently,
> it's not that the range is different. If it was just about the range,
> then "full_mm" would just initialize the range to the whole VM. But it
> *isn't* about the range. It's about the fact that a full-MM tear-down
> can fundamentally do different operations, knowing that there are no
> other threads using that VM any more.
> 
> So I really really object to playing games with PMD_SIZE or other
> idiocies, because it fundamentally mis-states the whole problem.

That's not what I suggesting (though I agree I wasn't clear).

The use of PMD_SIZE steps in a new flush_tlb_table_range() loop is
entirely and arch-specific optimisation. Only that the arch code doesn't
currently know which tlb_flush() it should use because need_flush is set
for both PTEs and page table tear down. We just need different flags
here to be able to optimise the arch code further.

> If ARM64 wants to make the "lead vs non-leaf" TLB operation, then you
> need to add a new flag, and you just set that flag when you tear down
> a page table (as opposed to just a PTE).

Indeed. Actually we could use need_flag only for PTEs and ignore it for
page table tear-down. With Will's patch, we can already check tlb->end
for what need_flush is currently doing and use need_flush in an
arch-specific way (and we could give a new name as well).

> > With RCU_TABLE_FREE, I think checking tlb->local.next would do the trick
> > but for x86 we can keep mmu_gather.need_flush only for pte clearing
> > and remove need_flush = 1 from p*_free_tlb() functions.
> 
> This is more confusion about what is going on.

Yes, and if we do this we may no longer understand the code in few weeks
time.

> I'd actually really really prefer to have the "need_flush = 1" for the
> page table tear-down case even for x86. No, if you never removed any
> PTE at all, it is possible that it's not actually needed because an
> x86 CPU isn't supposed to cache non-present page table entries (so if
> you could tear down the page tables because there were no entries,
> there should be no TLB entries, and there *hopefully* should be no
> caches of mid-level page tables either that need a TLB invalidate).

On ARM, as long as an intermediate page table entry is valid, even
though the full translation (PTE) is not, the CPU can go and cache it.
What's worse (and we've hit it before) is that it may even end up
reading something that looks like a valid PTE (of the PTE page has been
freed before the TLB invalidation) and it will stick around as a full
translation. So we need to flush the page table cache before freeing the
page table pages.

> But in practice, I'd not take that chance. If you tear down a page
> table, you should flush the TLB in that range (and note how I say *in*
> that range - an invalidate anywhere in the range should be sufficient,
> not "over the whole range"!), because quite frankly, from an
> implementation standpoint, I really think it's the sane and safe thing
> to do.

A single TLB is indeed enough for a single page table page removed. If
we queue multiple page table pages freeing, we accumulate the range via
pmd_free_tlb() etc. and we would eventually need multiple TLB
invalidations, at most one every PMD_SIZE (that's when !fullmm).

> So I would suggest you think of the x86 invlpg instruction as your
> "non-leaf invalidate". The same way you'd want to do non-leaf
> invalidate whenever you tear down a page table, you'd do "invlpg" on
> x86.

I need to dig some more in the x86 code, I'm not familiar with it. We
could do the page table cache invalidation non-lazily every time
pmd_free_tlb() is called, though it's not as optimal as we need a heavy
DSB barrier on ARM after each TLB invalidate.

> And no, we should *not* play games with "tlb->local.next". That just
> sounds completely and utterly insane. That's a hack, it's unclear,
> it's stupid, and it's connected to a totally irrelevant implementation
> detail, namely that random RCU freeing.
> 
> Set a flag, for chrissake. Just say "when you free a pmd/pud/pgd, set
> tlb->need_flush_inner to let the flusher know" (probably in *addition*
> to "tlb->need_flush", just to maintain that rule). Make it explicit,
> and make it obvious, and don't play games.

I agree.

-- 
Catalin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-06 18:38                 ` Catalin Marinas
@ 2014-11-06 21:29                   ` Linus Torvalds
  2014-11-07 16:50                     ` Catalin Marinas
  0 siblings, 1 reply; 27+ messages in thread
From: Linus Torvalds @ 2014-11-06 21:29 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Thu, Nov 6, 2014 at 10:38 AM, Catalin Marinas
<catalin.marinas@arm.com> wrote:
> On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote:
>
> Sorry, I wasn't clear enough about the "increments" part. I agreed with
> not using end = start + PMD_SIZE/PAGE_SIZE from your previous email
> already.

Ahh, I misunderstood. You're really just after the granularity of tlb flushes.

That's fine. That makes sense. In fact, how about adding "granularity"
to the mmu_gather structure, and then doing:\

 - in __tlb_reset_range(), setting it to ~0ul

 - add "granularity" to __tlb_adjust_range(), and make it do something like

       if (!tlb->fullmm) {
               tlb->granularity = min(tlb->granularity, granularity);
               tlb->start = min(tlb->start, address);
               tlb->end = max(tlb->end, address+1);
       }

and then the TLB flush logic would basically do

   address = tlb->start;
   do {
        flush(address);
        if (address + tlb->granularity < address)
                break;
        address = address + tlb->granularity;
   } while (address < tlb->end);

or something like that.

Now, if you unmap mixed ranges of large-pages and regular pages, you'd
still have that granularity of one page, but quite frankly, if you do
that, you probably deserve it. The common case is almost certainly
going to be just "unmap large pages" or "unmap normal pages".

And if it turns out that I'm completely wrong, and mixed granularities
are common, maybe there could be some hack in the "tlb->granularity"
calculations that just forces a TLB flush when the granularity
changes.

Hmm?

                 Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-06 21:29                   ` Linus Torvalds
@ 2014-11-07 16:50                     ` Catalin Marinas
  2014-11-10 13:56                       ` Will Deacon
  0 siblings, 1 reply; 27+ messages in thread
From: Catalin Marinas @ 2014-11-07 16:50 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Will Deacon, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Thu, Nov 06, 2014 at 09:29:54PM +0000, Linus Torvalds wrote:
> On Thu, Nov 6, 2014 at 10:38 AM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > On Thu, Nov 06, 2014 at 05:53:58PM +0000, Linus Torvalds wrote:
> >
> > Sorry, I wasn't clear enough about the "increments" part. I agreed with
> > not using end = start + PMD_SIZE/PAGE_SIZE from your previous email
> > already.
> 
> Ahh, I misunderstood. You're really just after the granularity of tlb flushes.

Yes. The granularity would also help when tearing down page tables as
the granule would be PMD_SIZE.

> That's fine. That makes sense. In fact, how about adding "granularity"
> to the mmu_gather structure, and then doing:\
> 
>  - in __tlb_reset_range(), setting it to ~0ul
> 
>  - add "granularity" to __tlb_adjust_range(), and make it do something like
> 
>        if (!tlb->fullmm) {
>                tlb->granularity = min(tlb->granularity, granularity);
>                tlb->start = min(tlb->start, address);
>                tlb->end = max(tlb->end, address+1);
>        }
> 
> and then the TLB flush logic would basically do
> 
>    address = tlb->start;
>    do {
>         flush(address);
>         if (address + tlb->granularity < address)
>                 break;
>         address = address + tlb->granularity;
>    } while (address < tlb->end);
> 
> or something like that.

Indeed. We'll come up with a patch after Will's clean-up.

> Now, if you unmap mixed ranges of large-pages and regular pages, you'd
> still have that granularity of one page, but quite frankly, if you do
> that, you probably deserve it. The common case is almost certainly
> going to be just "unmap large pages" or "unmap normal pages".

I think this could only happen with transparent huge pages that replaced
small pages in an anonymous mapping. I don't think munmap'ing them
happens very often.

Thanks.

-- 
Catalin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure
  2014-11-07 16:50                     ` Catalin Marinas
@ 2014-11-10 13:56                       ` Will Deacon
  0 siblings, 0 replies; 27+ messages in thread
From: Will Deacon @ 2014-11-10 13:56 UTC (permalink / raw)
  To: Catalin Marinas
  Cc: Linus Torvalds, Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Fri, Nov 07, 2014 at 04:50:04PM +0000, Catalin Marinas wrote:
> On Thu, Nov 06, 2014 at 09:29:54PM +0000, Linus Torvalds wrote:
> > That's fine. That makes sense. In fact, how about adding "granularity"
> > to the mmu_gather structure, and then doing:\
> > 
> >  - in __tlb_reset_range(), setting it to ~0ul
> > 
> >  - add "granularity" to __tlb_adjust_range(), and make it do something like
> > 
> >        if (!tlb->fullmm) {
> >                tlb->granularity = min(tlb->granularity, granularity);
> >                tlb->start = min(tlb->start, address);
> >                tlb->end = max(tlb->end, address+1);
> >        }
> > 
> > and then the TLB flush logic would basically do
> > 
> >    address = tlb->start;
> >    do {
> >         flush(address);
> >         if (address + tlb->granularity < address)
> >                 break;
> >         address = address + tlb->granularity;
> >    } while (address < tlb->end);
> > 
> > or something like that.
> 
> Indeed. We'll come up with a patch after Will's clean-up.

My clean-up is the patch I sent previously, plus the removal of need_flush.

Incremental diff for the latter part below. We drop a set of need_flush
from tlb_remove_table, but I can't figure out why it was there in the
first place (need_flush was already set by pXd_free_tlb).

Will

--->8

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index a9c9df0f60ff..c028fe37456f 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -39,7 +39,7 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 {
 	if (tlb->fullmm) {
 		flush_tlb_mm(tlb->mm);
-	} else if (tlb->end > 0) {
+	} else {
 		struct vm_area_struct vma = { .vm_mm = tlb->mm, };
 		flush_tlb_range(&vma, tlb->start, tlb->end);
 	}
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 340bc5c5ca2d..08848050922e 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -96,10 +96,9 @@ struct mmu_gather {
 #endif
 	unsigned long		start;
 	unsigned long		end;
-	unsigned int		need_flush : 1,	/* Did free PTEs */
 	/* we are in the middle of an operation to clear
 	 * a full mm and can make some optimizations */
-				fullmm : 1,
+	unsigned int		fullmm : 1,
 	/* we have performed an operation which
 	 * requires a complete flush of the tlb */
 				need_flush_all : 1;
@@ -131,10 +130,8 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 static inline void __tlb_adjust_range(struct mmu_gather *tlb,
 				      unsigned long address)
 {
-	if (!tlb->fullmm) {
-		tlb->start = min(tlb->start, address);
-		tlb->end = max(tlb->end, address + PAGE_SIZE);
-	}
+	tlb->start = min(tlb->start, address);
+	tlb->end = max(tlb->end, address + PAGE_SIZE);
 }
 
 static inline void __tlb_reset_range(struct mmu_gather *tlb)
@@ -154,7 +151,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 
 #define __tlb_end_vma(tlb, vma)					\
 	do {							\
-		if (!tlb->fullmm) {				\
+		if (!tlb->fullmm && tlb->end) {			\
 			tlb_flush(tlb);				\
 			__tlb_reset_range(tlb);			\
 		}						\
@@ -171,13 +168,12 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 /**
  * tlb_remove_tlb_entry - remember a pte unmapping for later tlb invalidation.
  *
- * Record the fact that pte's were really umapped in ->need_flush, so we can
- * later optimise away the tlb invalidate.   This helps when userspace is
- * unmapping already-unmapped pages, which happens quite a lot.
+ * Record the fact that pte's were really unmapped by updating the range,
+ * so we can later optimise away the tlb invalidate.   This helps when
+ * userspace is unmapping already-unmapped pages, which happens quite a lot.
  */
 #define tlb_remove_tlb_entry(tlb, ptep, address)		\
 	do {							\
-		tlb->need_flush = 1;				\
 		__tlb_adjust_range(tlb, address);		\
 		__tlb_remove_tlb_entry(tlb, ptep, address);	\
 	} while (0)
@@ -192,14 +188,12 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 
 #define tlb_remove_pmd_tlb_entry(tlb, pmdp, address)		\
 	do {							\
-		tlb->need_flush = 1;				\
 		__tlb_adjust_range(tlb, address);		\
 		__tlb_remove_pmd_tlb_entry(tlb, pmdp, address);	\
 	} while (0)
 
 #define pte_free_tlb(tlb, ptep, address)			\
 	do {							\
-		tlb->need_flush = 1;				\
 		__tlb_adjust_range(tlb, address);		\
 		__pte_free_tlb(tlb, ptep, address);		\
 	} while (0)
@@ -207,7 +201,6 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 #ifndef __ARCH_HAS_4LEVEL_HACK
 #define pud_free_tlb(tlb, pudp, address)			\
 	do {							\
-		tlb->need_flush = 1;				\
 		__tlb_adjust_range(tlb, address);		\
 		__pud_free_tlb(tlb, pudp, address);		\
 	} while (0)
@@ -215,7 +208,6 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
 
 #define pmd_free_tlb(tlb, pmdp, address)			\
 	do {							\
-		tlb->need_flush = 1;				\
 		__tlb_adjust_range(tlb, address);		\
 		__pmd_free_tlb(tlb, pmdp, address);		\
 	} while (0)
diff --git a/mm/memory.c b/mm/memory.c
index 0bc940e41ec9..8b1c1d2e7c67 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -220,7 +220,6 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 	/* Is it from 0 to ~0? */
 	tlb->fullmm     = !(start | (end+1));
 	tlb->need_flush_all = 0;
-	tlb->need_flush = 0;
 	tlb->local.next = NULL;
 	tlb->local.nr   = 0;
 	tlb->local.max  = ARRAY_SIZE(tlb->__pages);
@@ -236,7 +235,9 @@ void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm, unsigned long
 
 static void tlb_flush_mmu_tlbonly(struct mmu_gather *tlb)
 {
-	tlb->need_flush = 0;
+	if (!tlb->end)
+		return;
+
 	tlb_flush(tlb);
 #ifdef CONFIG_HAVE_RCU_TABLE_FREE
 	tlb_table_flush(tlb);
@@ -257,8 +258,6 @@ static void tlb_flush_mmu_free(struct mmu_gather *tlb)
 
 void tlb_flush_mmu(struct mmu_gather *tlb)
 {
-	if (!tlb->need_flush)
-		return;
 	tlb_flush_mmu_tlbonly(tlb);
 	tlb_flush_mmu_free(tlb);
 }
@@ -293,7 +292,7 @@ int __tlb_remove_page(struct mmu_gather *tlb, struct page *page)
 {
 	struct mmu_gather_batch *batch;
 
-	VM_BUG_ON(!tlb->need_flush);
+	VM_BUG_ON(!tlb->end);
 
 	batch = tlb->active;
 	batch->pages[batch->nr++] = page;
@@ -360,8 +359,6 @@ void tlb_remove_table(struct mmu_gather *tlb, void *table)
 {
 	struct mmu_table_batch **batch = &tlb->batch;
 
-	tlb->need_flush = 1;
-
 	/*
 	 * When there's less then two users of this mm there cannot be a
 	 * concurrent page-table walk.

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte
  2014-10-28 11:44 [RFC PATCH 0/2] Fix a couple of issues with zap_pte_range and MMU gather Will Deacon
  2014-10-28 11:44 ` [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure Will Deacon
@ 2014-10-28 11:44 ` Will Deacon
  2014-10-28 15:18   ` Linus Torvalds
  1 sibling, 1 reply; 27+ messages in thread
From: Will Deacon @ 2014-10-28 11:44 UTC (permalink / raw)
  To: torvalds, peterz; +Cc: linux-kernel, linux, benh, Will Deacon

When we encounter a dirty page during unmap, we force a TLB invalidation
to avoid a race with pte_mkclean and stale, dirty TLB entries in the
CPU.

This uses the same force_flush logic as the batch failure code, but
since we don't break out of the loop when finding a dirty pte, tlb->end
can be < addr as we only batch for present ptes. This can result in a
negative range being passed to subsequent TLB invalidation calls,
potentially leading to massive over-invalidation of the TLB (observed
in practice running firefox on arm64).

This patch fixes the issue by restricting the use of addr in the TLB
range calculations. The first range then ends up covering tlb->start to
min(tlb->end, addr), which corresponds to the currently batched range.
The second range then covers anything remaining, which may still lead to
a (much reduced) over-invalidation of the TLB.

Signed-off-by: Will Deacon <will.deacon@arm.com>
---
 mm/memory.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 3e503831e042..ea41508d41f3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1194,11 +1194,10 @@ again:
 		 * then update the range to be the remaining
 		 * TLB range.
 		 */
-		old_end = tlb->end;
-		tlb->end = addr;
+		tlb->end = old_end = min(tlb->end, addr);
 		tlb_flush_mmu_tlbonly(tlb);
-		tlb->start = addr;
-		tlb->end = old_end;
+		tlb->start = old_end;
+		tlb->end = end;
 	}
 	pte_unmap_unlock(start_pte, ptl);

-- 
2.1.1

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte
  2014-10-28 11:44 ` [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte Will Deacon
@ 2014-10-28 15:18   ` Linus Torvalds
  0 siblings, 0 replies; 27+ messages in thread
From: Linus Torvalds @ 2014-10-28 15:18 UTC (permalink / raw)
  To: Will Deacon
  Cc: Peter Zijlstra, Linux Kernel Mailing List,
	Russell King - ARM Linux, Benjamin Herrenschmidt

On Tue, Oct 28, 2014 at 4:44 AM, Will Deacon <will.deacon@arm.com> wrote:
> @@ -1194,11 +1194,10 @@ again:
>                  * then update the range to be the remaining
>                  * TLB range.
>                  */
> -               old_end = tlb->end;
> -               tlb->end = addr;
> +               tlb->end = old_end = min(tlb->end, addr);
>                 tlb_flush_mmu_tlbonly(tlb);
> -               tlb->start = addr;
> -               tlb->end = old_end;
> +               tlb->start = old_end;
> +               tlb->end = end;

I don't think this is right. Setting "tlb->end = end" looks very wrong
indeed, because "end" here inside zap_pte_range() is *not* the final
end of the zap range, it is just the end of the current set of pte's.

There's a reason the old code *saved* the old end value. You've now
ripped that out, and use the "old_end" for something else entirely.

Your arm64 version of tlb_add_flush() then hides the bug you just
introduced by updating the end range for each page you encounter. But
quite frankly, I think your problems are all fundamental to that very
issue. You're playing games with start/end during the TLB flush
itself, which is not how those things were designed to work.

So now you break everything that *doesn't* do your arm games.

                   Linus

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2014-11-10 13:56 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-28 11:44 [RFC PATCH 0/2] Fix a couple of issues with zap_pte_range and MMU gather Will Deacon
2014-10-28 11:44 ` [RFC PATCH 1/2] zap_pte_range: update addr when forcing flush after TLB batching faiure Will Deacon
2014-10-28 15:30   ` Linus Torvalds
2014-10-28 16:07     ` Will Deacon
2014-10-28 16:25       ` Linus Torvalds
2014-10-28 17:07         ` Will Deacon
2014-10-28 18:03           ` Linus Torvalds
2014-10-28 21:16         ` Benjamin Herrenschmidt
2014-10-28 21:32           ` Linus Torvalds
2014-10-28 21:40     ` Linus Torvalds
2014-10-29 19:47       ` Will Deacon
2014-10-29 21:11         ` Linus Torvalds
2014-10-29 21:27           ` Benjamin Herrenschmidt
2014-11-01 17:01             ` Linus Torvalds
2014-11-01 20:25               ` Benjamin Herrenschmidt
2014-11-03 17:56               ` Will Deacon
2014-11-03 18:05                 ` Linus Torvalds
2014-11-04 14:29         ` Catalin Marinas
2014-11-04 16:08           ` Linus Torvalds
2014-11-06 13:57             ` Catalin Marinas
2014-11-06 17:53               ` Linus Torvalds
2014-11-06 18:38                 ` Catalin Marinas
2014-11-06 21:29                   ` Linus Torvalds
2014-11-07 16:50                     ` Catalin Marinas
2014-11-10 13:56                       ` Will Deacon
2014-10-28 11:44 ` [RFC PATCH 2/2] zap_pte_range: fix partial TLB flushing in response to a dirty pte Will Deacon
2014-10-28 15:18   ` Linus Torvalds

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox