From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stefan Bader Subject: Re: [Xen-devel] Xen PV domain regression with KASLR enabled (kernel 3.16) Date: Fri, 22 Aug 2014 11:20:50 +0200 Message-ID: <53F70B72.7030407@canonical.com> References: <53E4B281.5050302@canonical.com> <53E4C5D5.2090103@citrix.com> <53E4E042.1070300@canonical.com> <53EA5782.1080301@canonical.com> <20140812190726.GC13996@laptop.dumpdata.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha512; protocol="application/pgp-signature"; boundary="Gcscjn5KuS6pROlKKdIpAwjvmeiCQV32l" Return-path: In-Reply-To: Sender: linux-kernel-owner@vger.kernel.org To: Kees Cook , Konrad Rzeszutek Wilk Cc: "xen-devel@lists.xensource.com" , David Vrabel , Linux Kernel Mailing List List-Id: xen-devel@lists.xenproject.org This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --Gcscjn5KuS6pROlKKdIpAwjvmeiCQV32l Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable On 21.08.2014 18:03, Kees Cook wrote: > On Tue, Aug 12, 2014 at 2:07 PM, Konrad Rzeszutek Wilk > wrote: >> On Tue, Aug 12, 2014 at 11:53:03AM -0700, Kees Cook wrote: >>> On Tue, Aug 12, 2014 at 11:05 AM, Stefan Bader >>> wrote: >>>> On 12.08.2014 19:28, Kees Cook wrote: >>>>> On Fri, Aug 8, 2014 at 7:35 AM, Stefan Bader wrote: >>>>>> On 08.08.2014 14:43, David Vrabel wrote: >>>>>>> On 08/08/14 12:20, Stefan Bader wrote: >>>>>>>> Unfortunately I have not yet figured out why this happens, but c= an confirm by >>>>>>>> compiling with or without CONFIG_RANDOMIZE_BASE being set that w= ithout KASLR all >>>>>>>> is ok, but with it enabled there are issues (actually a dom0 doe= s not even boot >>>>>>>> as a follow up error). >>>>>>>> >>>>>>>> Details can be seen in [1] but basically this is always some por= tion of a >>>>>>>> vmalloc allocation failing after hitting a freshly allocated PTE= space not being >>>>>>>> PTE_NONE (usually from a module load triggered by systemd-udevd)= =2E In the >>>>>>>> non-dom0 case this repeats many times but ends in a guest that a= llows login. In >>>>>>>> the dom0 case there is a more fatal error at some point causing = a crash. >>>>>>>> >>>>>>>> I have not tried this for a normal PV guest but for dom0 it also= does not help >>>>>>>> to add "nokaslr" to the kernel command-line. >>>>>>> >>>>>>> Maybe it's overlapping with regions of the virtual address space >>>>>>> reserved for Xen? What the the VA that fails? >>>>>>> >>>>>>> David >>>>>>> >>>>>> Yeah, there is some code to avoid some regions of memory (like ini= trd). Maybe >>>>>> missing p2m tables? I probably need to add debugging to find the f= ailing VA (iow >>>>>> not sure whether it might be somewhere in the stacktraces in the r= eport). >>>>>> >>>>>> The kernel-command line does not seem to be looked at. It should p= ut something >>>>>> into dmesg and that never shows up. Also today's random feature is= other PV >>>>>> guests crashing after a bit somewhere in the check_for_corruption = area... >>>>> >>>>> Right now, the kaslr code just deals with initrd, cmdline, etc. If >>>>> there are other reserved regions that aren't listed in the e820, it= 'll >>>>> need to locate and skip them. >>>>> >>>>> -Kees >>>>> >>>> Making my little steps towards more understanding I figured out that= it isn't >>>> the code that does the relocation. Even with that completely disable= d there were >>>> the vmalloc issues. What causes it seems to be the default of the up= per limit >>>> and that this changes the split between kernel and modules to 1G+1G = instead of >>>> 512M+1.5G. That is the reason why nokaslr has no effect. >>> >>> Oh! That's very interesting. There must be some assumption in Xen >>> about the kernel VM layout then? >> >> No. I think most of the changes that look at PTE and PMDs are are all >> in arch/x86/xen/mmu.c. I wonder if this is xen_cleanhighmap being >> too aggressive >=20 > (Sorry I had to cut our chat short at Kernel Summit!) >=20 > I sounded like there was another region of memory that Xen was setting > aside for page tables? But Stefan's investigation seems to show this > isn't about layout at boot (since the kaslr=3D0 case means no relocatio= n > is done). Sounds more like the split between kernel and modules area, > so I'm not sure how the memory area after the initrd would be part of > this. What should next steps be, do you think? Maybe layout, but not about placement of the kernel. Basically leaving KA= SLR enabled but shrink the possible range back to the original kernel/module = split is fine as well. I am bouncing between feeling close to understand to being confused. Konr= ad suggested xen_cleanhighmap being overly aggressive. But maybe its the oth= er way round. The warning that occurs first indicates that PTE that was obtained= for some vmalloc mapping is not unused (0) as it is expected. So it feels rat= her like some cleanup has *not* been done. Let me think aloud a bit... What seems to cause this, is the change of th= e kernel/module split from 512M:1.5G to 1G:1G (not exactly since there is 8= M vsyscalls and 2M hole at the end). Which in vaddr terms means: Before: ffffffff80000000 - ffffffff9fffffff (=3D512 MB) kernel text mapping, fro= m phys 0 ffffffffa0000000 - ffffffffff5fffff (=3D1526 MB) module mapping space After: ffffffff80000000 - ffffffffbfffffff (=3D1024 MB) kernel text mapping, fro= m phys 0 ffffffffc0000000 - ffffffffff5fffff (=3D1014 MB) module mapping space Now, *if* I got this right, this means the kernel starts on a vaddr that = is pointed at by: PGD[510]->PUD[510]->PMD[0]->PTE[0] In the old layout the module vaddr area would start in the same PUD area,= but with the change the kernel would cover PUD[510] and the module vaddr + vs= yscalls and the hole would cover PUD[511]. xen_cleanhighmap operates only on the kernel_level2_pgt which (speculatin= g a bit since I am not sure I understand enough details) I believe is the one PMD= pointed at by PGD[510]->PUD[510]. That could mean that before the change xen_cleanhighmap may touch some (the initial 512M) of the module vaddr sp= ace but not after the change. Maybe that also means it always should have covered= more but this would not be observed as long as modules would not claim more th= an 512M? I still need to check the vaddr ranges for which xen_cleanhighmap i= s actually called. The modules vaddr space would normally not be touched (o= nly with DEBUG set). I moved that to be unconditionally done but then this mi= ght be of no use when it needs to cover a different PMD... Really not sure here. But maybe a starter for others... -Stefan >=20 > -Kees >=20 >=20 >>> >>> -Kees >>> >>> -- >>> Kees Cook >>> Chrome OS Security >>> >>> _______________________________________________ >>> Xen-devel mailing list >>> Xen-devel@lists.xen.org >>> http://lists.xen.org/xen-devel >=20 >=20 >=20 --Gcscjn5KuS6pROlKKdIpAwjvmeiCQV32l Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1 iQIcBAEBCgAGBQJT9wuDAAoJEOhnXe7L7s6j46UP/2zAtzzywAz9tLng5xoR0piW TYm4NihAjL8gwhbut4M4H+XN+GjB7GV1fPjDi4bhLhsR+LJUN7cEPDOmnwVpjyTG p4KWh5mX9ApT8AY85yTb4tIV9aMHUvRSnoBbWLozgLOqcyvVchDRV6acs3clEfPR SgZk8xHk3O2dP0kj5dpEpknjnz9OVGY7FRhGpmUXmlZViYPX0AorfCW3WmgtI1sO 1O6CJUEV8BPPtt7g7cF2UvNRIECRaPFuV1m65AYXMDRKL5+kV24lc7gb0vfD1NHN gnzOm88seb051Vw33nmnDfdnm4d24kgOmc2BNeIYulZ16NjHt8DG9kNKJpbURwg1 jpdlRwmD7HU/2ukBwbWVoPxe6Qz/hAxE2ez9Tddyi6rpul8LXf4/EYRNAJy7ZVRf OKQJLmwQAJ92/hCz8f9Wq0kH/ntFKPocqdu/6rgxa+rtHCqNJjuEccHxG9NtX2CY 3nWFSR/shkUz/ZcMH7YIjQ4xsyk3WYL1nUm8MJ3aZKFlyM7kT/YI+egzsZb9oBEB tQZVuZR1qd+XXBnMmB/eKXWWZjGKj2u1Z2sIDSGs26yjSME+aBTiNFH8bYUyf4/g VEQzz1va18BrB5AyiTKtcjXmV8uFMZXtjxHBOnOAHthzJkP0fL7Srzg9ntYbjXDy /1JhgRuUxLQnk3YWz46R =XYyL -----END PGP SIGNATURE----- --Gcscjn5KuS6pROlKKdIpAwjvmeiCQV32l--