From: David Vrabel <david.vrabel@citrix.com>
To: Elena Ufimtseva <ufimtseva@gmail.com>
Cc: Steven Noonan <steven@uplinklabs.net>,
Daniel Borkmann <borkmann@iogearbox.net>,
Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>,
Boris Ostrovsky <boris.ostrovsky@oracle.com>,
xen-devel <xen-devel@lists.xenproject.org>,
George Dunlap <george.dunlap@eu.citrix.com>,
Dario Faggioli <dario.faggioli@citrix.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
Andrea Arcangeli <aarcange@redhat.com>,
"Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>,
Linux Kernel mailing List <linux-kernel@vger.kernel.org>,
Mel Gorman <mgorman@suse.de>, Rik van Riel <riel@redhat.com>,
Alex Thorlton <athorlton@sgi.com>,
Andrew Morton <akpm@linux-foundation.org>,
Vlastimil Babka <vbabka@suse.cz>,
Michel Lespinasse <walken@google.com>
Subject: Re: [BISECTED] Linux 3.12.7 introduces page map handling regression
Date: Fri, 24 Jan 2014 11:05:20 +0000 [thread overview]
Message-ID: <52E248F0.1060708@citrix.com> (raw)
In-Reply-To: <CAEr7rXjge6rKzxbwy+0A6-5YhVZL9WGmaLrDYbE8H5hrtwq_4A@mail.gmail.com>
On 23/01/14 16:23, Elena Ufimtseva wrote:
> On Wed, Jan 22, 2014 at 3:33 PM, Steven Noonan <steven@uplinklabs.net> wrote:
>> On Wed, Jan 22, 2014 at 03:18:50PM -0500, Elena Ufimtseva wrote:
>>> On Wed, Jan 22, 2014 at 9:29 AM, Daniel Borkmann <borkmann@iogearbox.net> wrote:
>>>> On 01/22/2014 08:29 AM, Steven Noonan wrote:
>>>>>
>>>>> On Wed, Jan 22, 2014 at 12:02:15AM -0500, Konrad Rzeszutek Wilk wrote:
>>>>>>
>>>>>> On Tue, Jan 21, 2014 at 07:20:45PM -0800, Steven Noonan wrote:
>>>>>>>
>>>>>>> On Tue, Jan 21, 2014 at 06:47:07PM -0800, Linus Torvalds wrote:
>>>>>>>>
>>>>>>>> On Tue, Jan 21, 2014 at 5:49 PM, Greg Kroah-Hartman
>>>>>>>> <gregkh@linuxfoundation.org> wrote:
>>>>>>
>>>>>>
>>>>>> Adding extra folks to the party.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Odds are this also shows up in 3.13, right?
>>>>>>>
>>>>>>>
>>>>>>> Reproduced using 3.13 on the PV guest:
>>>>>>>
>>>>>>> [ 368.756763] BUG: Bad page map in process mp
>>>>>>> pte:80000004a67c6165 pmd:e9b706067
>>>>>>> [ 368.756777] page:ffffea001299f180 count:0 mapcount:-1
>>>>>>> mapping: (null) index:0x0
>>>>>>> [ 368.756781] page flags: 0x2fffff80000014(referenced|dirty)
>>>>>>> [ 368.756786] addr:00007fd1388b7000 vm_flags:00100071
>>>>>>> anon_vma:ffff880e9ba15f80 mapping: (null) index:7fd1388b7
>>>>>>> [ 368.756792] CPU: 29 PID: 618 Comm: mp Not tainted 3.13.0-ec2
>>>>>>> #1
>>>>>>> [ 368.756795] ffff880e9b718958 ffff880e9eaf3cc0
>>>>>>> ffffffff814d8748 00007fd1388b7000
>>>>>>> [ 368.756803] ffff880e9eaf3d08 ffffffff8116d289
>>>>>>> 0000000000000000 0000000000000000
>>>>>>> [ 368.756809] ffff880e9b7065b8 ffffea001299f180
>>>>>>> 00007fd1388b8000 ffff880e9eaf3e30
>>>>>>> [ 368.756815] Call Trace:
>>>>>>> [ 368.756825] [<ffffffff814d8748>] dump_stack+0x45/0x56
>>>>>>> [ 368.756833] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>>>>>> [ 368.756837] [<ffffffff8116eae3>]
>>>>>>> unmap_single_vma+0x583/0x890
>>>>>>> [ 368.756842] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>>>>>> [ 368.756847] [<ffffffff81175dac>] unmap_region+0xac/0x120
>>>>>>> [ 368.756852] [<ffffffff81176379>] ? vma_rb_erase+0x1c9/0x210
>>>>>>> [ 368.756856] [<ffffffff81177f10>] do_munmap+0x280/0x370
>>>>>>> [ 368.756860] [<ffffffff81178041>] vm_munmap+0x41/0x60
>>>>>>> [ 368.756864] [<ffffffff81178f32>] SyS_munmap+0x22/0x30
>>>>>>> [ 368.756869] [<ffffffff814e70ed>]
>>>>>>> system_call_fastpath+0x1a/0x1f
>>>>>>> [ 368.756872] Disabling lock debugging due to kernel taint
>>>>>>> [ 368.760084] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>>>>> idx:0 val:-1
>>>>>>> [ 368.760091] BUG: Bad rss-counter state mm:ffff880e9d079680
>>>>>>> idx:1 val:1
>>>>>>>
>>>>>>>>
>>>>>>>> Probably. I don't have a Xen PV setup to test with (and very little
>>>>>>>> interest in setting one up).. And I have a suspicion that it might not
>>>>>>>> be so much about Xen PV, as perhaps about the kind of hardware.
>>>>>>>>
>>>>>>>> I suspect the issue has something to do with the magic _PAGE_NUMA
>>>>>>>> tie-in with _PAGE_PRESENT. And then mprotect(PROT_NONE) ends up
>>>>>>>> removing the _PAGE_PRESENT bit, and now the crazy numa code is
>>>>>>>> confused.
>>>>>>>>
>>>>>>>> The whole _PAGE_NUMA thing is a f*cking horrible hack, and shares the
>>>>>>>> bit with _PAGE_PROTNONE, which is why it then has that tie-in to
>>>>>>>> _PAGE_PRESENT.
>>>>>>>>
>>>>>>>> Adding Andrea to the Cc, because he's the author of that horridness.
>>>>>>>> Putting Steven's test-case here as an attachement for Andrea, maybe
>>>>>>>> that makes him go "Ahh, yes, silly case".
>>>>>>>>
>>>>>>>> Also added Kirill, because he was involved the last _PAGE_NUMA debacle.
>>>>>>>>
>>>>>>>> Andrea, you can find the thread on lkml, but it boils down to commit
>>>>>>>> 1667918b6483 (backported to 3.12.7 as 3d792d616ba4) breaking the
>>>>>>>> attached test-case (but apparently only under Xen PV). There it
>>>>>>>> apparently causes a "BUG: Bad page map .." error.
>>>>>>
>>>>>>
>>>>>> I *think* it is due to the fact that pmd_numa and pte_numa is getting the
>>>>>> _raw_
>>>>>> value of PMDs and PTEs. That is - it does not use the pvops interface
>>>>>> and instead reads the values directly from the page-table. Since the
>>>>>> page-table is also manipulated by the hypervisor - there are certain
>>>>>> flags it also sets to do its business. It might be that it uses
>>>>>> _PAGE_GLOBAL as well - and Linux picks up on that. If it was using
>>>>>> pte_flags that would invoke the pvops interface.
>>>>>>
>>>>>> Elena, Dariof and George, you guys had been looking at this a bit deeper
>>>>>> than I have. Does the Xen hypervisor use the _PAGE_GLOBAL for PV guests?
>
> It does use _PAGE_GLOBAL for guest user pages
>
>>>>>>
>>>>>> This not-compiled-totally-bad-patch might shed some light on what I was
>>>>>> thinking _could_ fix this issue - and IS NOT A FIX - JUST A HACK.
>>>>>> It does not fix it for PMDs naturally (as there are no PMD paravirt ops
>>>>>> for that).
>>>>>
>>>>>
>>>>> Unfortunately the Totally Bad Patch seems to make no difference. I am
>>>>> still able to repro the issue:
>>>
>>> Steven, do you use numa=fake on boot cmd line for pv guest?
>>>
>>> I had similar issue on pv guest. Let me check if the fix that resolved
>>> this for me will help with 3.13.
>>
>> Nope:
>>
>> # cat /proc/cmdline
>> root=/dev/xvda1 ro rootwait rootfstype=ext4 nomodeset console=hvc0 earlyprintk=xen,verbose loglevel=7
>
>>
>>>
>>>>
>>>>
>>>> Maybe this one is also related to this BUG here (cc'ed people investigating
>>>> this one) ...
>>>>
>>>> https://lkml.org/lkml/2014/1/10/427
>>>>
>>>> ... not sure, though.
>>>>
>>>>
>>>>> [ 346.374929] BUG: Bad page map in process mp
>>>>> pte:80000004ae928065 pmd:e993f9067
>>>>> [ 346.374942] page:ffffea0012ba4a00 count:0 mapcount:-1 mapping:
>>>>> (null) index:0x0
>>>>> [ 346.374946] page flags: 0x2fffff80000014(referenced|dirty)
>>>>> [ 346.374951] addr:00007f06a9bbb000 vm_flags:00100071
>>>>> anon_vma:ffff880e9939fe00 mapping: (null) index:7f06a9bbb
>>>>> [ 346.374956] CPU: 29 PID: 609 Comm: mp Not tainted 3.13.0-ec2+
>>>>> #1
>>>>> [ 346.374960] ffff880e9cc38da8 ffff880e991a3cc0 ffffffff814d8768
>>>>> 00007f06a9bbb000
>>>>> [ 346.374967] ffff880e991a3d08 ffffffff8116d289 0000000000000000
>>>>> 0000000000000000
>>>>> [ 346.374972] ffff880e993f9dd8 ffffea0012ba4a00 00007f06a9bbc000
>>>>> ffff880e991a3e30
>>>>> [ 346.374979] Call Trace:
>>>>> [ 346.374988] [<ffffffff814d8768>] dump_stack+0x45/0x56
>>>>> [ 346.374996] [<ffffffff8116d289>] print_bad_pte+0x229/0x250
>>>>> [ 346.375000] [<ffffffff8116eae3>] unmap_single_vma+0x583/0x890
>>>>> [ 346.375006] [<ffffffff8116feb5>] unmap_vmas+0x65/0x90
>>>>> [ 346.375011] [<ffffffff81175dbc>] unmap_region+0xac/0x120
>>>>> [ 346.375016] [<ffffffff81176389>] ? vma_rb_erase+0x1c9/0x210
>>>>> [ 346.375021] [<ffffffff81177f20>] do_munmap+0x280/0x370
>>>>> [ 346.375025] [<ffffffff81178051>] vm_munmap+0x41/0x60
>>>>> [ 346.375029] [<ffffffff81178f42>] SyS_munmap+0x22/0x30
>>>>> [ 346.375034] [<ffffffff814e712d>]
>>>>> system_call_fastpath+0x1a/0x1f
>>>>> [ 346.375037] Disabling lock debugging due to kernel taint
>>>>> [ 346.380082] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>>>>> idx:0 val:-1
>>>>> [ 346.380088] BUG: Bad rss-counter state mm:ffff880e9d22bc00
>>>>> idx:1 val:1
>>>>>
>>>>> This dump doesn't look dramatically different, either.
>>>>>
>>>>>>
>>>>>> The other question is - how is AutoNUMA running when it is not enabled?
>>>>>> Shouldn't those _PAGE_NUMA ops be nops when AutoNUMA hasn't even been
>>>>>> turned on?
>>>>>
>>>>>
>>>>> Well, NUMA_BALANCING is enabled in the kernel config[1], but I presume you
>>>>> mean not enabled at runtime?
>>>>>
>>>>> [1]
>>>>> http://git.uplinklabs.net/snoonan/projects/archlinux/ec2/ec2-packages.git/tree/linux-ec2/config.x86_64
>>>
>>>
>>>
>>> --
>>> Elena
>
> I was able to reproduce this consistently, also with the latest mm
> patches from yesterday.
> Can you please try this:
>
> diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
> index ce563be..76dcf96 100644
> --- a/arch/x86/xen/mmu.c
> +++ b/arch/x86/xen/mmu.c
> @@ -365,7 +365,7 @@ void xen_ptep_modify_prot_commit(struct mm_struct
> *mm, unsigned long addr,
> /* Assume pteval_t is equivalent to all the other *val_t types. */
> static pteval_t pte_mfn_to_pfn(pteval_t val)
> {
> - if (val & _PAGE_PRESENT) {
> + if ((val & _PAGE_PRESENT) || ((val &
> (_PAGE_NUMA|_PAGE_PRESENT)) == _PAGE_NUMA)) {
if (val & (_PAGE_PRESENT | _PAGE_NUMA))
is equivalent.
David
next prev parent reply other threads:[~2014-01-24 11:05 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-01-21 23:27 [BISECTED] Linux 3.12.7 introduces page map handling regression Steven Noonan
2014-01-22 1:49 ` Greg Kroah-Hartman
2014-01-22 2:47 ` Linus Torvalds
2014-01-22 3:20 ` Steven Noonan
2014-01-22 5:02 ` Konrad Rzeszutek Wilk
2014-01-22 5:02 ` Konrad Rzeszutek Wilk
2014-01-22 7:29 ` Steven Noonan
2014-01-22 7:29 ` Steven Noonan
2014-01-22 14:29 ` Daniel Borkmann
2014-01-22 20:18 ` Elena Ufimtseva
2014-01-22 20:18 ` Elena Ufimtseva
2014-01-22 20:33 ` Steven Noonan
2014-01-23 16:23 ` Elena Ufimtseva
2014-01-23 23:20 ` Steven Noonan
2014-01-24 4:28 ` Elena Ufimtseva
2014-01-24 4:28 ` Elena Ufimtseva
2014-01-23 23:20 ` Steven Noonan
2014-01-24 11:05 ` David Vrabel
2014-01-24 11:05 ` David Vrabel [this message]
2014-01-24 13:38 ` Mel Gorman
2014-01-24 13:38 ` Mel Gorman
2014-01-26 18:02 ` Elena Ufimtseva
2014-02-04 6:58 ` Elena Ufimtseva
2014-02-04 6:58 ` Elena Ufimtseva
2014-02-04 11:44 ` [PATCH] Subject: [PATCH] xen: Properly account for _PAGE_NUMA during xen pte translations Mel Gorman
2014-02-04 11:44 ` Mel Gorman
2014-02-04 11:48 ` David Vrabel
2014-02-04 11:48 ` David Vrabel
2014-02-04 11:48 ` David Vrabel
2014-02-04 14:38 ` Konrad Rzeszutek Wilk
2014-02-04 14:38 ` Konrad Rzeszutek Wilk
2014-02-04 14:38 ` Konrad Rzeszutek Wilk
2014-02-04 11:44 ` Mel Gorman
2014-01-26 18:02 ` [BISECTED] Linux 3.12.7 introduces page map handling regression Elena Ufimtseva
2014-01-23 16:23 ` Elena Ufimtseva
2014-01-22 20:33 ` Steven Noonan
2014-01-22 14:29 ` Daniel Borkmann
2014-01-22 18:07 ` Rik van Riel
2014-01-22 18:24 ` Linus Torvalds
2014-01-22 18:39 ` Rik van Riel
2014-01-24 11:43 ` Mel Gorman
2014-01-23 17:03 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=52E248F0.1060708@citrix.com \
--to=david.vrabel@citrix.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=athorlton@sgi.com \
--cc=boris.ostrovsky@oracle.com \
--cc=borkmann@iogearbox.net \
--cc=dario.faggioli@citrix.com \
--cc=george.dunlap@eu.citrix.com \
--cc=gregkh@linuxfoundation.org \
--cc=kirill.shutemov@linux.intel.com \
--cc=konrad.wilk@oracle.com \
--cc=linux-kernel@vger.kernel.org \
--cc=mgorman@suse.de \
--cc=riel@redhat.com \
--cc=steven@uplinklabs.net \
--cc=torvalds@linux-foundation.org \
--cc=ufimtseva@gmail.com \
--cc=vbabka@suse.cz \
--cc=walken@google.com \
--cc=xen-devel@lists.xenproject.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.