From mboxrd@z Thu Jan  1 00:00:00 1970
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Subject: L1[0x1fb] = 0000000000000000 which faults on one type
 of machine but on another works?
Date: Wed, 16 Mar 2011 18:19:12 -0400
Message-ID: <20110316221912.GA13035@dumpdata.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: quoted-printable
Return-path: <xen-devel-bounces@lists.xensource.com>
Content-Disposition: inline
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: xen-devel@lists.xensource.com, gianni.tedesco@citrix.com, andrew.thomas@oracle.com, Jeremy Fitzhardinge <jeremy@goop.org>, Ian Campbell <Ian.Campbell@eu.citrix.com>, keir.xen@gmail.com
Cc: swente@infinitumb.de
List-Id: xen-devel@lists.xenproject.org

I am troubleshooting an issue where the Linux kernel tries
to dereference a not present entry. I have a fix for this
in for-2.6.32/bug-fixes .. but please read on.

Specifically it tries to derefence the fixmapped value of
APIC_BASE. The fixmapped value of APIC_BASE is actually not set
due to git commit a1d8e2fa8325064338b2da1bcf0d7a0473883c284
which adds this in arch/x86/kernel/acpi/boot.c:

static void __init acpi_register_lapic_address(unsigned long address)
 {
        /* Xen dom0 doesn't have usable lapics */
       if (xen_initial_domain())
             return;
=20
        mp_lapic_addr =3D address;

	set_fixmap_nocache(FIX_APIC_BASE, address);

Later on we use 'native_apic_read' which tries to use the APIC_BASE as
address (it is present to be @ slot FIX_APIC_BASE of the fixmap
API) and it fails (on some machines).

Since we don't call 'set_fixmap_nocache(FIX_APIC_BASE)' and=20
if one were to go through the pagetable this is what we get:


[    0.000000] SMP: Allowing 1 CPUs, 0 hotplug CPUs
[    0.000000] mapped APIC to ffffffffff5fb000 (00000000)
(XEN) d0:v0: unhandled page fault (ec=3D0000)
(XEN) Pagetable walk from ffffffffff5fb020:
(XEN)  L4[0x1ff] =3D 0000000221003067 0000000000001003
(XEN)  L3[0x1ff] =3D 0000000221004067 0000000000001004
(XEN)  L2[0x1fa] =3D 0000000221771067 0000000000001771=20
(XEN)  L1[0x1fb] =3D 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 0 (vcpu#0) crashed on cpu#0:
(XEN) ----[ Xen-4.1-110309  x86_64  debug=3Dy  Tainted:    C ]----
(XEN) CPU:    0
(XEN) RIP:    e033:[<ffffffff8102b5d1>]
(XEN) RFLAGS: 0000000000000292   EM: 1   CONTEXT: pv guest
(XEN) rax: ffffffff8164cf50   rbx: 000000026ec00000   rcx: 00000000ffffdd=
85
(XEN) rdx: 00000000ffffffff   rsi: 0000000000000000   rdi: 00000000000000=
20
(XEN) rbp: ffffffff81643ea8   rsp: ffffffff81643e50   r8:  00000000000000=
02
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 00000000000000=
00
(XEN) r12: ffff880013671800   r13: 00000000bff66000   r14: ffffffffffffff=
ff
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000000006=
f0
(XEN) cr3: 0000000221001000   cr2: ffffffffff5fb020
(XEN) ds: 0000   es: 0000   fs: 0000   gs: 0000   ss: e02b   cs: e033
(XEN) Guest stack trace from rsp=3Dffffffff81643e50:

Which is to say that the L1 has this:
0000000115771fa0:  00000000 00000000 00000000 00000000
0000000115771fb0:  00000000 00000000 00000000 00000000
0000000115771fc0:  00000000 00000000 15770067 80100001
0000000115771fd0:  15770067 80100001 00000000 00000000
0000000115771fe0:  00000000 00000000 00000000 00000000
0000000115771ff0:  00000000 00000000 00000000 00000000

L1[0x1fb] is machine address 115771fd8, which has nothing in it.

OK, so I've come up a fix that is a back-port of how 2.6.38 does it
which is that it removes the check I mentioned above and in xen_set_fixma=
p
we set the FIX_APIC_BASE to actually point to a dummy ioapic_mapping.=20
It is 7cb068cf1ba90425e12f3a7b3caed9d018fa9b8c in for-2.6.32/bug-fixes

Gianni, you might want to check this out in case it fixes the problem you
are experiencing.

But one thing I can't understand is why on one machine (IBM x3850)
I get this crash, while another one with the same pagetable contents
(L1 has nothing for 0x1fb) it works just fine? I added a panic and used
the Xen hypervisor kdb to manually inspect the pagetable, and it has
the same contents as the IBM x3850 -but it boots fine with this invalid v=
alue.
Any ideas?


FYI, seems another user (Sven S=FCbert) IBM x3650 hits the same bug. And =
with
this fix he is able to boot.