From mboxrd@z Thu Jan 1 00:00:00 1970 From: Konrad Rzeszutek Wilk Subject: L1[0x1fb] = 0000000000000000 which faults on one type of machine but on another works? Date: Wed, 16 Mar 2011 18:19:12 -0400 Message-ID: <20110316221912.GA13035@dumpdata.com> Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Return-path: Content-Disposition: inline List-Unsubscribe: , List-Post: List-Help: List-Subscribe: , Sender: xen-devel-bounces@lists.xensource.com Errors-To: xen-devel-bounces@lists.xensource.com To: xen-devel@lists.xensource.com, gianni.tedesco@citrix.com, andrew.thomas@oracle.com, Jeremy Fitzhardinge , Ian Campbell , keir.xen@gmail.com Cc: swente@infinitumb.de List-Id: xen-devel@lists.xenproject.org I am troubleshooting an issue where the Linux kernel tries to dereference a not present entry. I have a fix for this in for-2.6.32/bug-fixes .. but please read on. Specifically it tries to derefence the fixmapped value of APIC_BASE. The fixmapped value of APIC_BASE is actually not set due to git commit a1d8e2fa8325064338b2da1bcf0d7a0473883c284 which adds this in arch/x86/kernel/acpi/boot.c: static void __init acpi_register_lapic_address(unsigned long address) { /* Xen dom0 doesn't have usable lapics */ if (xen_initial_domain()) return; =20 mp_lapic_addr =3D address; set_fixmap_nocache(FIX_APIC_BASE, address); Later on we use 'native_apic_read' which tries to use the APIC_BASE as address (it is present to be @ slot FIX_APIC_BASE of the fixmap API) and it fails (on some machines). Since we don't call 'set_fixmap_nocache(FIX_APIC_BASE)' and=20 if one were to go through the pagetable this is what we get: [ 0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs [ 0.000000] mapped APIC to ffffffffff5fb000 (00000000) (XEN) d0:v0: unhandled page fault (ec=3D0000) (XEN) Pagetable walk from ffffffffff5fb020: (XEN) L4[0x1ff] =3D 0000000221003067 0000000000001003 (XEN) L3[0x1ff] =3D 0000000221004067 0000000000001004 (XEN) L2[0x1fa] =3D 0000000221771067 0000000000001771=20 (XEN) L1[0x1fb] =3D 0000000000000000 ffffffffffffffff (XEN) domain_crash_sync called from entry.S (XEN) Domain 0 (vcpu#0) crashed on cpu#0: (XEN) ----[ Xen-4.1-110309 x86_64 debug=3Dy Tainted: C ]---- (XEN) CPU: 0 (XEN) RIP: e033:[] (XEN) RFLAGS: 0000000000000292 EM: 1 CONTEXT: pv guest (XEN) rax: ffffffff8164cf50 rbx: 000000026ec00000 rcx: 00000000ffffdd= 85 (XEN) rdx: 00000000ffffffff rsi: 0000000000000000 rdi: 00000000000000= 20 (XEN) rbp: ffffffff81643ea8 rsp: ffffffff81643e50 r8: 00000000000000= 02 (XEN) r9: 0000000000000000 r10: 0000000000000000 r11: 00000000000000= 00 (XEN) r12: ffff880013671800 r13: 00000000bff66000 r14: ffffffffffffff= ff (XEN) r15: 0000000000000000 cr0: 000000008005003b cr4: 00000000000006= f0 (XEN) cr3: 0000000221001000 cr2: ffffffffff5fb020 (XEN) ds: 0000 es: 0000 fs: 0000 gs: 0000 ss: e02b cs: e033 (XEN) Guest stack trace from rsp=3Dffffffff81643e50: Which is to say that the L1 has this: 0000000115771fa0: 00000000 00000000 00000000 00000000 0000000115771fb0: 00000000 00000000 00000000 00000000 0000000115771fc0: 00000000 00000000 15770067 80100001 0000000115771fd0: 15770067 80100001 00000000 00000000 0000000115771fe0: 00000000 00000000 00000000 00000000 0000000115771ff0: 00000000 00000000 00000000 00000000 L1[0x1fb] is machine address 115771fd8, which has nothing in it. OK, so I've come up a fix that is a back-port of how 2.6.38 does it which is that it removes the check I mentioned above and in xen_set_fixma= p we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping.=20 It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes Gianni, you might want to check this out in case it fixes the problem you are experiencing. But one thing I can't understand is why on one machine (IBM x3850) I get this crash, while another one with the same pagetable contents (L1 has nothing for 0x1fb) it works just fine? I added a panic and used the Xen hypervisor kdb to manually inspect the pagetable, and it has the same contents as the IBM x3850 -but it boots fine with this invalid v= alue. Any ideas? FYI, seems another user (Sven S=FCbert) IBM x3650 hits the same bug. And = with this fix he is able to boot.