Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)

xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed

From: Stefan Bader <stefan.bader@canonical.com>
To: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Kees Cook <keescook@chromium.org>,
	"xen-devel@lists.xensource.com" <xen-devel@lists.xensource.com>,
	David Vrabel <david.vrabel@citrix.com>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>
Subject: Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16)
Date: Wed, 27 Aug 2014 10:03:10 +0200	[thread overview]
Message-ID: <53FD90BE.6090709@canonical.com> (raw)
In-Reply-To: <20140826160100.GA14835@laptop.dumpdata.com>

[-- Attachment #1: Type: text/plain, Size: 10238 bytes --]

On 26.08.2014 18:01, Konrad Rzeszutek Wilk wrote:
> On Fri, Aug 22, 2014 at 11:20:50AM +0200, Stefan Bader wrote:
>> On 21.08.2014 18:03, Kees Cook wrote:
>>> On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk
>>> <konrad.wilk@oracle.com> wrote:
>>>> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote:
>>>>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader
>>>>> <stefan.bader@canonical.com> wrote:
>>>>>> On 12.08.2014 19:28, Kees Cook wrote:
>>>>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader <stefan.bader@canonical.com> wrote:
>>>>>>>> On 08.08.2014 14:43, David Vrabel wrote:
>>>>>>>>> On 08/08/14 12:20, Stefan Bader wrote:
>>>>>>>>>> Unfortunately I have not yet figured out why this happens, but can confirm by
>>>>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that without KASLR all
>>>>>>>>>> is ok, but with it enabled there are issues (actually a dom0 does not even boot
>>>>>>>>>> as a follow up error).
>>>>>>>>>>
>>>>>>>>>> Details can be seen in [1] but basically this is always some portion of a
>>>>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE space not being
>>>>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd). In the
>>>>>>>>>> non-dom0 case this repeats many times but ends in a guest that allows login. In
>>>>>>>>>> the dom0 case there is a more fatal error at some point causing a crash.
>>>>>>>>>>
>>>>>>>>>> I have not tried this for a normal PV guest but for dom0 it also does not help
>>>>>>>>>> to add "nokaslr" to the kernel command-line.
>>>>>>>>>
>>>>>>>>> Maybe it's overlapping with regions of the virtual address space
>>>>>>>>> reserved for Xen?  What the the VA that fails?
>>>>>>>>>
>>>>>>>>> David
>>>>>>>>>
>>>>>>>> Yeah, there is some code to avoid some regions of memory (like initrd). Maybe
>>>>>>>> missing p2m tables? I probably need to add debugging to find the failing VA (iow
>>>>>>>> not sure whether it might be somewhere in the stacktraces in the report).
>>>>>>>>
>>>>>>>> The kernel-command line does not seem to be looked at. It should put something
>>>>>>>> into dmesg and that never shows up. Also today's random feature is other PV
>>>>>>>> guests crashing after a bit somewhere in the check_for_corruption area...
>>>>>>>
>>>>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If
>>>>>>> there are other reserved regions that aren't listed in the e820, it'll
>>>>>>> need to locate and skip them.
>>>>>>>
>>>>>>> -Kees
>>>>>>>
>>>>>> Making my little steps towards more understanding I figured out that it isn't
>>>>>> the code that does the relocation. Even with that completely disabled there were
>>>>>> the vmalloc issues. What causes it seems to be the default of the upper limit
>>>>>> and that this changes the split between kernel and modules to 1G+1G instead of
>>>>>> 512M+1.5G. That is the reason why nokaslr has no effect.
>>>>>
>>>>> Oh! That's very interesting. There must be some assumption in Xen
>>>>> about the kernel VM layout then?
>>>>
>>>> No. I think most of the changes that look at PTE and PMDs are are all
>>>> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being
>>>> too aggressive
>>>
>>> (Sorry I had to cut our chat short at Kernel Summit!)
>>>
>>> I sounded like there was another region of memory that Xen was setting
>>> aside for page tables? But Stefan's investigation seems to show this
>>> isn't about layout at boot (since the kaslr=0 case means no relocation
>>> is done). Sounds more like the split between kernel and modules area,
>>> so I'm not sure how the memory area after the initrd would be part of
>>> this. What should next steps be, do you think?
>>
>> Maybe layout, but not about placement of the kernel. Basically leaving KASLR
>> enabled but shrink the possible range back to the original kernel/module split
>> is fine as well.
>>
>> I am bouncing between feeling close to understand to being confused. Konrad
>> suggested xen_cleanhighmap being overly aggressive. But maybe its the other way
>> round. The warning that occurs first indicates that PTE that was obtained for
>> some vmalloc mapping is not unused (0) as it is expected. So it feels rather
>> like some cleanup has *not* been done.
>>
>> Let me think aloud a bit... What seems to cause this, is the change of the
>> kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8M
>> vsyscalls and 2M hole at the end). Which in vaddr terms means:
>>
>> Before:
>> ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
>> ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
>>
>> After:
>> ffffffff80000000 - ffffffffbfffffff (=1024 MB) kernel text mapping, from phys 0
>> ffffffffc0000000 - ffffffffff5fffff (=1014 MB) module mapping space
>>
>> Now, *if* I got this right, this means the kernel starts on a vaddr that is
>> pointed at by:
>>
>> PGD[510]->PUD[510]->PMD[0]->PTE[0]
>>
>> In the old layout the module vaddr area would start in the same PUD area, but
>> with the change the kernel would cover PUD[510] and the module vaddr + vsyscalls
>> and the hole would cover PUD[511].
> 
> I think there is a fixmap there too?

Right, they forgot that in Documentation/x86/x86_64/mm... but head_64.S has it.
So fixmap seems to be in the 2M space before the vsyscalls.
Btw, apparently I got the PGD index wrong. It is of course 511, not 510.

init_level4_pgt[511]->level3_kernel_pgt[510]->level2_kernel_pgt[0..255]->kernel
                                                               [256..511]->mod
                                       [511]->level2_fixmap_pgt[0..505]->mod
                                                               [506]->fixmap
                                                               [507..510]->vsysc
                                                               [511]->hole

With the change being level2_kernel_pgt completely covering kernel only.

>>
>> xen_cleanhighmap operates only on the kernel_level2_pgt which (speculating a bit
>> since I am not sure I understand enough details) I believe is the one PMD
>> pointed at by PGD[510]->PUD[510]. That could mean that before the change
> 
> That sounds right.
> 
> I don't know if you saw:
> 
> 1248 #ifdef DEBUG                                                                    
> 1249         /* This is superflous and is not neccessary, but you know what          
> 1250          * lets do it. The MODULES_VADDR -> MODULES_END should be clear of      
> 1251          * anything at this stage. */                                           
> 1252         xen_cleanhighmap(MODULES_VADDR, roundup(MODULES_VADDR, PUD_SIZE) - 1);  
> 1253 #endif                                                                          
> 1254 }                                    

I saw that but it would have no effect, even with running it. Because
xen_cleanhighmap clamps the pmds it walks over to the kernel_level2_pgt page.
Now MODULES_VADDR is mapped only from level2_fixmap_pgt.
Even with the old layout it might do less that anticipated as it would only
cover 512M and stop then. But I think it really does not matter.
> 
> Which was me being a bit paranoid and figured it might help in troubleshooting.
> If you disable that does it work?
> 
>> xen_cleanhighmap may touch some (the initial 512M) of the module vaddr space but
>> not after the change. Maybe that also means it always should have covered more
>> but this would not be observed as long as modules would not claim more than
>> 512M? I still need to check the vaddr ranges for which xen_cleanhighmap is
>> actually called. The modules vaddr space would normally not be touched (only
>> with DEBUG set). I moved that to be unconditionally done but then this might be
>> of no use when it needs to cover a different PMD...
> 
> What does the toolstack say in regards to allocating the memory? It is pretty
> verbose (domainloginfo..something) in printing out the vaddr of where
> it stashes the kernel, ramdisk, P2M, and the pagetables (which of course
> need to fit all within the 512MB, now 1GB area).

That is taken from starting a 2G PV domU with pvgrub (not pygrub):

Xen Minimal OS!
  start_info: 0xd90000(VA)
    nr_pages: 0x80000
  shared_inf: 0xdfe92000(MA)
     pt_base: 0xd93000(VA)
nr_pt_frames: 0xb
    mfn_list: 0x990000(VA)
   mod_start: 0x0(VA)
     mod_len: 0
       flags: 0x0
    cmd_line:
  stack:      0x94f860-0x96f860
MM: Init
      _text: 0x0(VA)
     _etext: 0x6000d(VA)
   _erodata: 0x78000(VA)
     _edata: 0x80b00(VA)
stack start: 0x94f860(VA)
       _end: 0x98fe68(VA)
  start_pfn: da1
    max_pfn: 80000
Mapping memory range 0x1000000 - 0x80000000
setting 0x0-0x78000 readonly


For a moment I was puzzled by the use of max_pfn_mapped in the generic
cleanup_highmap function of 64bit x86. It limits the cleanup to the start of the
mfn_list. And the max_pfn_mapped value changes soon after to reflect the total
amount of memory of the guest.
Making a copy showed it to be around 51M at the time of cleanup. That initially
looks suspect but Xen already replaced the page tables. The compile-time
variants would have 2M large pages on the whole level2_kernel_pgt range. But as
far as I can see, the Xen provided ones don't put in mappings for anything
beyond the provided boot stack which is clean in the xen_cleanhighmap.

So not much further... but then I think I know what I do next. Probably should
have done before. I'll replace the WARN_ON in vmalloc that triggers by a panic
and at least get a crash dump of that situation when it occurs. Then I can dig
in there with crash (really should have thought of that before)...

-Stefan
> 
>>
>> Really not sure here. But maybe a starter for others...
>>
>> -Stefan
>>
>>>
>>> -Kees
>>>
>>>
>>>>>
>>>>> -Kees
>>>>>
>>>>> --
>>>>> Kees Cook
>>>>> Chrome OS Security
>>>>>
>>>>> _______________________________________________
>>>>> Xen-devel mailing list
>>>>> Xen-devel@lists.xen.org
>>>>> http://lists.xen.org/xen-devel
>>>
>>>
>>>
>>
>>
> 
> 



[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

next prev parent reply	other threads:[~2014-08-27  8:03 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-08-08 11:20 Xen PV domain regression with KASLR enabled (kernel 3.16) Stefan Bader
2014-08-08 12:43 ` [Xen-devel] " David Vrabel
2014-08-08 14:35   ` Stefan Bader
2014-08-12 17:28     ` Kees Cook
2014-08-12 18:05       ` Stefan Bader
2014-08-12 18:53         ` Kees Cook
2014-08-12 19:07           ` Konrad Rzeszutek Wilk
2014-08-21 16:03             ` Kees Cook
2014-08-22  9:20               ` Stefan Bader
2014-08-26 16:01                 ` Konrad Rzeszutek Wilk
2014-08-27  8:03                   ` Stefan Bader [this message]
2014-08-27 20:49                     ` Konrad Rzeszutek Wilk
2014-08-28 18:01                       ` [PATCH] Solved the Xen PV/KASLR riddle Stefan Bader
2014-08-28 22:22                         ` Kees Cook
2014-08-28 22:42                         ` Andrew Cooper
2014-08-29  8:37                           ` [Xen-devel] " Stefan Bader
2014-08-29 14:19                             ` Andrew Cooper
2014-08-29 14:32                               ` Stefan Bader
2014-08-29 14:43                                 ` Andrew Cooper
2014-08-29 14:08                         ` Konrad Rzeszutek Wilk
2014-08-29 14:27                           ` Stefan Bader
2014-08-29 14:31                             ` David Vrabel
2014-08-29 14:35                               ` Stefan Bader
2014-08-29 14:44                             ` [Xen-devel] " Jan Beulich
2014-08-29 14:55                               ` Konrad Rzeszutek Wilk
2014-09-01  4:03                                 ` Juergen Gross
2014-09-02 19:22                                   ` Konrad Rzeszutek Wilk
2014-09-03  4:07                                     ` Juergen Gross

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53FD90BE.6090709@canonical.com \
    --to=stefan.bader@canonical.com \
    --cc=david.vrabel@citrix.com \
    --cc=keescook@chromium.org \
    --cc=konrad.wilk@oracle.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=xen-devel@lists.xensource.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).