From: Benjamin Herrenschmidt <benh@au1.ibm.com>
To: Hugh Dickins <hughd@google.com>
Cc: linuxppc-dev@lists.ozlabs.org,
Anton Blanchard <anton@au1.ibm.com>,
Paul Mackerras <paulus@samba.org>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>,
David Gibson <david@gibson.dropbear.id.au>
Subject: Re: 3.10-rc ppc64 corrupts usermem when swapping
Date: Thu, 30 May 2013 17:00:36 +1000 [thread overview]
Message-ID: <1369897236.3928.93.camel@pasglop> (raw)
In-Reply-To: <alpine.LNX.2.00.1305292148550.9560@eggly.anvils>
On Wed, 2013-05-29 at 22:47 -0700, Hugh Dickins wrote:
> Running my favourite swapping load (repeated make -j20 kernel builds
> in tmpfs in parallel with repeated make -j20 kernel builds in ext4 on
> loop on tmpfs file, all limited by mem=700M and swap 1.5G) on 3.10-rc
> on PowerMac G5, the test dies with corrupted usermem after a few hours.
>
> Variously, segmentation fault or Binutils assertion fail or gcc Internal
> error in either or both builds: usually signs of swapping or TLB flushing
> gone wrong. Sometimes the tmpfs build breaks first, sometimes the ext4 on
> loop on tmpfs, so at least it looks unrelated to loop. No problem on x86.
>
> This is 64-bit kernel but 4k pages and old SuSE 11.1 32-bit userspace.
>
> I've just finished a manual bisection on arch/powerpc/mm (which might
> have been a wrong guess, but has paid off): the first bad commit is
> 7e74c3921ad9610c0b49f28b8fc69f7480505841
> "powerpc: Fix hpte_decode to use the correct decoding for page sizes".
Ok, I have other reasons to think is wrong. I debugged a case last week
where after kexec we still had stale TLB entries, due to the TLB cleanup
not working.
Thanks for doing that bisection ! I'll investigate ASAP (though it will
probably have to wait for tomorrow unless Paul beats me to it)
> I don't know if it's actually swapping to swap that's triggering the
> problem, or a more general page reclaim or TLB flush problem. I hit
> it originally when trying to test Mel Gorman's pagevec series on top
> of 3.10-rc; and though I then reproduced it without that series, it
> did seem to take much longer: so I have been applying Mel's series to
> speed up each step of the bisection. But if I went back again, might
> find it was just chance that I hit it sooner with Mel's series than
> without. So, you're probably safe to ignore that detail, but I
> mention it just in case it turns out to have some relevance.
>
> Something else peculiar that I've been doing in these runs, may or may
> not be relevant: I've been running swapon and swapoff repeatedly in the
> background, so that we're doing swapoff even while busy building.
>
> I probably can't go into much more detail on the test (it's hard
> to get the balance right, to be swapping rather than OOMing or just
> running without reclaim), but can test any patches you'd like me to
> try (though it may take 24 hours for me to report back usefully).
I think it's just failing to invalidate the TLB properly. At least one
bug I can spot just looking at it:
static void native_hpte_invalidate(unsigned long slot, unsigned long vpn,
int psize, int ssize, int local)
.../...
native_lock_hpte(hptep);
hpte_v = hptep->v;
actual_psize = hpte_actual_psize(hptep, psize);
if (actual_psize < 0) {
native_unlock_hpte(hptep);
local_irq_restore(flags);
return;
}
That's wrong. We must still perform the TLB invalidation even if the
hash PTE is empty.
In fact, Aneesh, this is a problem with MPSS for your THP work, I just
thought about it.
The reason is that if a hash bucket gets full, we "evict" a more/less
random entry from it. When we do that we don't invalidate the TLB
(hpte_remove) because we assume the old translation is still technically
"valid".
However that means that an hpte_invalidate *must* invalidate the TLB
later on even if it's not hitting the right entry in the hash.
However, I can see why that cannot work with THP/MPSS since you have no
way to know the page size from the PTE anymore....
So my question is, apart from hpte_decode used by kexec, which I will
fix by just blowing the whole TLB when not running phyp, why do you need
the "actual" size in invalidate and updatepp ? You really can't rely on
the size passed by the upper layers ?
Cheers,
Ben.
next prev parent reply other threads:[~2013-05-30 7:00 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-30 5:47 3.10-rc ppc64 corrupts usermem when swapping Hugh Dickins
2013-05-30 7:00 ` Benjamin Herrenschmidt [this message]
2013-05-30 8:27 ` Aneesh Kumar K.V
2013-05-30 8:33 ` Benjamin Herrenschmidt
2013-05-30 13:48 ` Hugh Dickins
2013-05-31 5:05 ` Hugh Dickins
2013-05-31 5:31 ` Benjamin Herrenschmidt
2013-05-31 14:45 ` Hugh Dickins
[not found] ` <87ppw7mrx7.fsf@linux.vnet.ibm.com>
2013-05-31 22:23 ` Benjamin Herrenschmidt
2013-06-02 7:22 ` Aneesh Kumar K.V
2013-06-02 18:19 ` Hugh Dickins
2013-05-30 16:10 ` Aneesh Kumar K.V
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1369897236.3928.93.camel@pasglop \
--to=benh@au1.ibm.com \
--cc=aneesh.kumar@linux.vnet.ibm.com \
--cc=anton@au1.ibm.com \
--cc=david@gibson.dropbear.id.au \
--cc=hughd@google.com \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.