From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jeremy Fitzhardinge <jeremy@goop.org>
Subject: Re: current xen/stable 2.6.32.9 failed upgrade
	from	2.6.31.6
Date: Tue, 23 Mar 2010 22:10:48 -0700
Message-ID: <4BA99ED8.8030209@goop.org>
References: <20100306115833.GA28039@orion.carnet.hr>	<20100306132711.GK2580@reaktio.net>	<20100307233147.GA20068@orion.carnet.hr>	<20100311150823.GA9011@orion.carnet.hr>	<20100311192456.GY1878@reaktio.net>	<20100312114139.GA4067@orion.carnet.hr>	<20100312120914.GA15561@orion.carnet.hr>	<20100323231853.GA21109@orion.carnet.hr>
	<20100323232223.GA22681@orion.carnet.hr>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Return-path: <xen-devel-bounces@lists.xensource.com>
In-Reply-To: <20100323232223.GA22681@orion.carnet.hr>
List-Unsubscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=unsubscribe>
List-Post: <mailto:xen-devel@lists.xensource.com>
List-Help: <mailto:xen-devel-request@lists.xensource.com?subject=help>
List-Subscribe: <http://lists.xensource.com/mailman/listinfo/xen-devel>,
	<mailto:xen-devel-request@lists.xensource.com?subject=subscribe>
Sender: xen-devel-bounces@lists.xensource.com
Errors-To: xen-devel-bounces@lists.xensource.com
To: Josip Rodin <joy@entuzijast.net>
Cc: Xen-devel <xen-devel@lists.xensource.com>
List-Id: xen-devel@lists.xenproject.org

On 03/23/2010 04:22 PM, Josip Rodin wrote:
> On Wed, Mar 24, 2010 at 12:18:53AM +0100, Josip Rodin wrote:
>    
>> On Fri, Mar 12, 2010 at 01:09:14PM +0100, Josip Rodin wrote:
>>      
>>> On Fri, Mar 12, 2010 at 12:41:39PM +0100, Josip Rodin wrote:
>>>        
>>>> And now here goes the whole output preceding the 2.6.32 crash:
>>>>          
>>> [...]
>>>        
>>>> In the meantime there was another update to the stable branch, I'll go
>>>> compile that...
>>>>          
>>> The symptoms remained the same, only the CPU MHz calculation and some memory
>>> offsets are different.
>>>
>>> (XEN) mm.c:720:d0 Bad L1 flags 800000
>>> (XEN) mm.c:4221:d0 ptwr_emulate: could not get_page_from_l1e()
>>> (XEN) d0:v0: unhandled page fault (ec=0003)
>>> (XEN) Pagetable walk from ffff8800014fdfd8:
>>> (XEN)  L4[0x110] = 0000000115002067 0000000000001002
>>> (XEN)  L3[0x000] = 0000000115006067 0000000000001006
>>> (XEN)  L2[0x00a] = 0000000116c8a067 0000000000002c8a
>>> (XEN)  L1[0x0fd] = 00100001154fd065 00000000000014fd
>>> (XEN) domain_crash_sync called from entry.S
>>> (XEN) Domain 0 (vcpu#0) crashed on cpu#0:
>>> (XEN) ----[ Xen-3.4  x86_64  debug=n  Not tainted ]----
>>>        
>> FWIW I tried to git bisect this in the last couple of days, but the result
>> turned out to be fairly obvious and useless as after 14 bisections
>> I only came to this:
>>
>> commit 18ecfad3aaeead019b0e07078f643deaa7d10d44
>>      x86: make /dev/mem mappings _PAGE_IOMAP
>> commit 56f27a6d47275f6dc94adf3ecc5fe958cdcdebee
>>      xen/dom0: add XEN_DOM0 config option
>>
>> I didn't follow through with the last bisection, it had seemed increasingly
>> futile for a while now... :)
>>
>> I saw a peculiar side effect at one point, when I went back to a random
>> working 2.6.31.1 dom0, all userland processes started crashing with Illegal
>> instruction. One iLO reset later, it's all good again. I'm guessing it was
>> a transient broken state.
>>
>> And then when I gave up and updated to latest xen/stable for one last try,
>> that was the biggest d'oh moment - it's fixed :) Was it de67ec8b?
>>      

Yes.

> BTW with the working .32 kernel, the log says:
>
> [    0.000000] ACPI: IOAPIC (id[0x08] address[0xfec00000] gsi_base[0])
> [    0.000000] IOAPIC[0]: apic_id 8, version 0, address 0xfec00000, GSI 0-0
> [    0.000000] ACPI: IOAPIC (id[0x09] address[0xfec80000] gsi_base[24])
> [    0.000000] IOAPIC[1]: apic_id 9, version 0, address 0xfec80000, GSI 24-24
> [    0.000000] ACPI: IOAPIC (id[0x0a] address[0xfec80400] gsi_base[48])
> [    0.000000] IOAPIC[2]: apic_id 10, version 0, address 0xfec80400, GSI 48-48
> [    0.000000] ACPI: IOAPIC (id[0x0b] address[0xfec84000] gsi_base[72])
> [    0.000000] IOAPIC[3]: apic_id 11, version 0, address 0xfec84000, GSI 72-72
> [    0.000000] ACPI: IOAPIC (id[0x0c] address[0xfec84400] gsi_base[96])
> [    0.000000] IOAPIC[4]: apic_id 12, version 0, address 0xfec84400, GSI 96-96
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 high edge)
> [    0.000000] ERROR: Unable to locate IOAPIC for GSI 2
> [    0.000000] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
> [    0.000000] ERROR: Unable to locate IOAPIC for GSI 9
>
> [...]
>
> [    0.023694] ACPI: bus type pci registered
> [    0.023915] PCI: Found Intel Corporation E7520 Memory Controller Hub with MMCONFIG support.
> [    0.023935] PCI: MCFG configuration 0: base e0000000 segment 0 buses 0 - 255
> [    0.023942] PCI: Not using MMCONFIG.
> [    0.023948] PCI: Using configuration type 1 for base access
> [    0.023959] PCI: HP ProLiant DL380 detected, enabling pci=bfsort.
> [    0.028634] bio: create slab<bio-0>  at 0
> [    0.030115] ERROR: Unable to locate IOAPIC for GSI 9
>
> Is there anything I can do to avoid these?
>    

These are just noise; the kernel thinks it can poke at the IO APICs, but 
they're owned by Xen and so don't exist for the kernel; instead some 
alternate mechanisms come into play to keep the interrupts flowing.  At 
some point I hope we can completely remove all trace of the APICs from 
the kernel's sight, so it won't even try to access them and print these 
confused messages.

     J