From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: linuxppc-dev list <linuxppc-dev@ozlabs.org>,
Paul Mackerras <paulus@samba.org>,
Anton Blanchard <anton@samba.org>,
Linux Memory Management <linux-mm@kvack.org>
Subject: Re: Possible ppc64 (and maybe others ?) mm problem
Date: Tue, 27 Feb 2007 21:43:39 +1100 [thread overview]
Message-ID: <45E40B5B.4040909@yahoo.com.au> (raw)
In-Reply-To: <1172569714.11949.73.camel@localhost.localdomain>
Benjamin Herrenschmidt wrote:
> We have a lingering problem that I stumbled upon the other day just
> before leaving for a rather long trip. I have a few minutes now and feel
> like writing it all down before I forget :-)
>
> So the main issue is that the ppc64 mmu hash table must not ever have a
> duplicate entry. That is, there must never be two entries in a hash
> group that can match the same virtual address. Ever.
>
> I don't know wether other archs with things like software loaded TLBs
> can have a similar problems ending up in trying to load two TLB entries
> for the same address and what the consequences can be.
>
> Thus it's very important when invalidating mappings, to always make sure
> we cannot fault in a new entry before we have cleared any possible
> previous entry from the hash table on powerpc (and possibly by
> extension, from the TLB on some sw loaded platforms).
>
> The powerpc kernel tracks the fact that a hash table entry may be
> present for a given linux PTE via a bit in the PTE (_PAGE_HASHPTE)
> along, on 64 bits, with some bits indicating which slot is used in a
> given "group" so we don't have to perform a search when invalidating.
>
> Now there is a race that I'm pretty sure we might hit, though I don't
> know if it's always been there or only got there due to the recent
> locking changes arund the vm, but basically, the problem is when we
> batch invalidations.
>
> When doing things like pte_clear, which are part of a batch, we
> atomically replace the PTE with a non-present one, and store the old one
> in the batch for further hash invalidations.
>
> That means that we must -not- allow a new PTE to be re-faulted in for
> that same page and thus potentially re-hashed in before we actually
> flush the hash table (which we do when "completing" the hash, with
> flush_tlb_*() called from tlb_finish_mmu() among others.
>
> The possible scenario I found out however was when looking at this like
> unmap_mapping_range(). It looks like this can call zap_page_range() and
> thus do batched invalidations, without taking any useful locks
> preventing new PTEs to be faulted in on the same range before we
> invalidate the batch.
>
> This can happen more specifically if the previously hashed PTE had
> non-full permissions (for example, is read only). In this case, we would
> hit do_page_fault() which wouldn't see any pte_present() and would
> basically fault a new one in despite one being already present in the
> hash table.
>
> I think we used to be safe thanks to the PTL, but not anymore. We
> sort-of assumed that insertions vs. removal races of that sort would
> never happen because we would always either be protected by the mmap_sem
> or the ptl while doing a batch.
I think you're right.
> The "quick fix" I can see would be for us to have a way to flush a
> pending batch in zap_pte_range(), before we unlock the PTE page (that is
> before pte_unmap_unlock()). That would prevent batches from spawning
> accross PTE page locks (whatever the granularity of that lock is).
That seems like it would work. The granularity is a pmd page worth of
ptes (PTRS_PER_PTE). That looks quite a bit larger than your flush batch
size, so it hopefully won't hurt too much.
> I suppose the above can be acheived by "hijacking" the
> arch_leave_lazy_mmu_mode() hook that was added for paravirt ops and make
> it flush any pending batch on powerpc, though I haven't had time to grep
> around other call sites to see if that could be a performance issue in
> other areas.
So long as SPLIT_PTLOCK_CPUS is a generic and not a per-arch config option,
it seems safer to tlb_finish_mmu unconditionally, before dropping the PTL.
Then, some new tlb flush hooks could be added for ptl-less-safe flush
designs to convert over to. No?
Alternatively, we could do it the other way around and make it a per-arch
option, defaulting to off, and with a new set of tlb hooks for those that
want to keep the batch inside the PTL.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
WARNING: multiple messages have this Message-ID (diff)
From: Nick Piggin <nickpiggin@yahoo.com.au>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>,
Anton Blanchard <anton@samba.org>,
linuxppc-dev list <linuxppc-dev@ozlabs.org>,
Linux Memory Management <linux-mm@kvack.org>
Subject: Re: Possible ppc64 (and maybe others ?) mm problem
Date: Tue, 27 Feb 2007 21:43:39 +1100 [thread overview]
Message-ID: <45E40B5B.4040909@yahoo.com.au> (raw)
In-Reply-To: <1172569714.11949.73.camel@localhost.localdomain>
Benjamin Herrenschmidt wrote:
> We have a lingering problem that I stumbled upon the other day just
> before leaving for a rather long trip. I have a few minutes now and feel
> like writing it all down before I forget :-)
>
> So the main issue is that the ppc64 mmu hash table must not ever have a
> duplicate entry. That is, there must never be two entries in a hash
> group that can match the same virtual address. Ever.
>
> I don't know wether other archs with things like software loaded TLBs
> can have a similar problems ending up in trying to load two TLB entries
> for the same address and what the consequences can be.
>
> Thus it's very important when invalidating mappings, to always make sure
> we cannot fault in a new entry before we have cleared any possible
> previous entry from the hash table on powerpc (and possibly by
> extension, from the TLB on some sw loaded platforms).
>
> The powerpc kernel tracks the fact that a hash table entry may be
> present for a given linux PTE via a bit in the PTE (_PAGE_HASHPTE)
> along, on 64 bits, with some bits indicating which slot is used in a
> given "group" so we don't have to perform a search when invalidating.
>
> Now there is a race that I'm pretty sure we might hit, though I don't
> know if it's always been there or only got there due to the recent
> locking changes arund the vm, but basically, the problem is when we
> batch invalidations.
>
> When doing things like pte_clear, which are part of a batch, we
> atomically replace the PTE with a non-present one, and store the old one
> in the batch for further hash invalidations.
>
> That means that we must -not- allow a new PTE to be re-faulted in for
> that same page and thus potentially re-hashed in before we actually
> flush the hash table (which we do when "completing" the hash, with
> flush_tlb_*() called from tlb_finish_mmu() among others.
>
> The possible scenario I found out however was when looking at this like
> unmap_mapping_range(). It looks like this can call zap_page_range() and
> thus do batched invalidations, without taking any useful locks
> preventing new PTEs to be faulted in on the same range before we
> invalidate the batch.
>
> This can happen more specifically if the previously hashed PTE had
> non-full permissions (for example, is read only). In this case, we would
> hit do_page_fault() which wouldn't see any pte_present() and would
> basically fault a new one in despite one being already present in the
> hash table.
>
> I think we used to be safe thanks to the PTL, but not anymore. We
> sort-of assumed that insertions vs. removal races of that sort would
> never happen because we would always either be protected by the mmap_sem
> or the ptl while doing a batch.
I think you're right.
> The "quick fix" I can see would be for us to have a way to flush a
> pending batch in zap_pte_range(), before we unlock the PTE page (that is
> before pte_unmap_unlock()). That would prevent batches from spawning
> accross PTE page locks (whatever the granularity of that lock is).
That seems like it would work. The granularity is a pmd page worth of
ptes (PTRS_PER_PTE). That looks quite a bit larger than your flush batch
size, so it hopefully won't hurt too much.
> I suppose the above can be acheived by "hijacking" the
> arch_leave_lazy_mmu_mode() hook that was added for paravirt ops and make
> it flush any pending batch on powerpc, though I haven't had time to grep
> around other call sites to see if that could be a performance issue in
> other areas.
So long as SPLIT_PTLOCK_CPUS is a generic and not a per-arch config option,
it seems safer to tlb_finish_mmu unconditionally, before dropping the PTL.
Then, some new tlb flush hooks could be added for ptl-less-safe flush
designs to convert over to. No?
Alternatively, we could do it the other way around and make it a per-arch
option, defaulting to off, and with a new set of tlb hooks for those that
want to keep the batch inside the PTL.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2007-02-27 10:50 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-02-27 9:48 Possible ppc64 (and maybe others ?) mm problem Benjamin Herrenschmidt
2007-02-27 9:48 ` Benjamin Herrenschmidt
2007-02-27 10:43 ` Nick Piggin [this message]
2007-02-27 10:43 ` Nick Piggin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=45E40B5B.4040909@yahoo.com.au \
--to=nickpiggin@yahoo.com.au \
--cc=anton@samba.org \
--cc=benh@kernel.crashing.org \
--cc=linux-mm@kvack.org \
--cc=linuxppc-dev@ozlabs.org \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.