All of lore.kernel.org
 help / color / mirror / Atom feed
* Xen 3.2.1-rc1: FATAL PAGE FAULT
@ 2008-04-03  4:34 Christopher S. Aker
  2008-04-03 14:04 ` Christopher S. Aker
  0 siblings, 1 reply; 10+ messages in thread
From: Christopher S. Aker @ 2008-04-03  4:34 UTC (permalink / raw)
  To: xen devel

Xen: 3.2.1-rc1 (I can get the exact changeset if needed)
domU: 2.6.16.33 PAE

(XEN) ----[ Xen-3.2.1-rc1  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e008:[<ffff828c8013dee4>] put_page_type+0x17/0x107
(XEN) RFLAGS: 0000000000210282   CONTEXT: hypervisor
(XEN) rax: 00001c9f2d2abca8   rbx: ffff9f232d2abca8   rcx: 0000000080000000
(XEN) rdx: 000000b72dedde51   rsi: 00000000002f25fd   rdi: ffff9f232d2abca8
(XEN) rbp: ffff8300cee0fcb8   rsp: ffff8300cee0fc98   r8:  0000000000000000
(XEN) r9:  00000000deadbeef   r10: ffff828c801c5bf0   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: ffff9f232d2abca8   r14: ffff8300cfc84100
(XEN) r15: ffff8300cfc84118   cr0: 000000008005003b   cr4: 00000000000026b0
(XEN) cr3: 000000062ffd7000   cr2: ffff9f232d2abcc0
(XEN) ds: 007b   es: 007b   fs: 0000   gs: 0033   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff8300cee0fc98:
(XEN)    ffff8300cee0fcd8 ffff9f232d2abca8 0000000000000000 
00000000002f25fd
(XEN)    ffff8300cee0fcd8 ffff828c8013b409 ffff8300cfc850f8 
ffff8302f25fd000
(XEN)    ffff8300cee0fd08 ffff828c8013c06d ffff8300cfc84100 
ffff8284075def88
(XEN)    0000000068000001 ffff8300cfc850f8 ffff8300cee0fd38 
ffff828c8013de5a
(XEN)    0000000060000001 0000000068000000 ffff8284075def88 
ffff8300cfc850f8
(XEN)    ffff8300cee0fd68 ffff828c8013df63 ffff8284075def88 
ffff8284075def88
(XEN)    ffff8284075def88 ffff8300cfc84100 ffff8300cee0fdb8 
ffff828c80131680
(XEN)    0000000088000000 0000000080000000 ffff8300cee0ff28 
ffff8300cfc84100
(XEN)    ffff8300cfc84100 00000000b31fc868 0000000000000000 
0000000000000000
(XEN)    ffff8300cee0fdd8 ffff828c80131a94 ffff8300cfc84100 
0000000000000000
(XEN)    ffff8300cee0fe08 ffff828c80105638 ffff8300cee0fe18 
ffff828c80114d70
(XEN)    00000000b31fc868 fffffffffffffff3 ffff8300cee0ff08 
ffff828c8010479f
(XEN)    ffff8300cee0fe48 ffff8300cee34130 0000000000000003 
0001b932a9ddc50a
(XEN)    0000000000200282 0000000000000000 0000000500000002 
083ca594b7b50067
(XEN)    0832ab4c011fc898 b7dadc50b7b5d68c b7a733e400000001 
00000001b79fccdc
(XEN)    080facafb31fc8c8 081361e008313e98 080797e7b76c1934 
00000000b76c1950
(XEN)    b7da802c00000060 b761db6c00000000 0805946cb31fc8e8 
b7da802cb761db6c
(XEN)    b7dab6a000000000 00000002b5d5451c a5dba1eea5dba1ee 
0000001f00000000
(XEN)    ffff8300cee0fee8 ffff8300cee34100 0000000000000000 
0000000000000000
(XEN)    0000000000000000 0000000000000000 00007cff311f00b7 
ffff828c801bdd50
(XEN) Xen call trace:
(XEN)    [<ffff828c8013dee4>] put_page_type+0x17/0x107
(XEN)    [<ffff828c8013b409>] put_page_from_l3e+0x3f/0x4e
(XEN)    [<ffff828c8013c06d>] free_l3_table+0x78/0xc4
(XEN)    [<ffff828c8013de5a>] free_page_type+0x1d4/0x247
(XEN)    [<ffff828c8013df63>] put_page_type+0x96/0x107
(XEN)    [<ffff828c80131680>] relinquish_memory+0xce/0x262
(XEN)    [<ffff828c80131a94>] domain_relinquish_resources+0xd1/0x1b0
(XEN)    [<ffff828c80105638>] domain_kill+0x77/0x164
(XEN)    [<ffff828c8010479f>] do_domctl+0x4dd/0xc1e
(XEN)    [<ffff828c801bdd50>] compat_tracing_off+0xb/0x64
(XEN)
(XEN) Pagetable walk from ffff9f232d2abcc0:
(XEN)  L4[0x13e] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 3:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff9f232d2abcc0
(XEN) ****************************************

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc1: FATAL PAGE FAULT
  2008-04-03  4:34 Xen 3.2.1-rc1: FATAL PAGE FAULT Christopher S. Aker
@ 2008-04-03 14:04 ` Christopher S. Aker
  2008-04-03 15:55   ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Christopher S. Aker @ 2008-04-03 14:04 UTC (permalink / raw)
  To: xen devel; +Cc: Keir Fraser

Keir Fraser wrote:
 > On 3/4/08 14:27, "Christopher S. Aker" <caker@theshore.net> wrote:
 >
 >> I misspoke, dom0 was the 2.6.16.33.  domUs were a mix of 2.6.24.3
 >> pv_ops, and 2.6.18.8.  We have about a dozen of these boxes deployed
 >> with this version, each with 30-40 domains just doing their thing --
 >> nothing crazy.
 >
 > That's interesting. 2.6.24 is less tested than other Linux kernels, and
 > being pv_ops it is quite different. It's not unlikely to have corner-case
 > bugs that crash it or, worst case, tickle dormant problems in the 
hypervisor
 > itself.
 >
 >> Maybe the symbols would help just a little bit?  In any case, here are
 >> the files:
 >>
 >> http://theshore.net/~caker/xen/BUGfatal_page_fault/
 >
 > I will take a look. It might help narrow down the possibilities a bit.
 >
 >> I guess I'll set up a thrash test environment full of nothing but
 >> domains looping crashme and make -j kernel builds and the like.  Sounds
 >> like fun.
 >
 > Okay, is this a bug you've seen exactly once so far? That would be 
annoying!

So far just the one time.

We just took Xen out of (a three year) beta, and so we're gearing up for 
a large deployment and need to eliminate any potential host/hypervisor 
crashes.  I can deal with domain bugs, but having the whole box go down 
is painful.  Needless to say, I'm anxious to get this fixed, and will 
help in any way I can.  Can I provide anything else that you can think of?

In the meantime, we'll work up a thrash-xen box.

Thanks,
-Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc1: FATAL PAGE FAULT
  2008-04-03 14:04 ` Christopher S. Aker
@ 2008-04-03 15:55   ` Keir Fraser
  2008-04-22 18:19     ` Xen 3.2.1-rc5: " Christopher S. Aker
  0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2008-04-03 15:55 UTC (permalink / raw)
  To: Christopher S. Aker, xen devel

On 3/4/08 15:04, "Christopher S. Aker" <caker@theshore.net> wrote:

>>> Maybe the symbols would help just a little bit?  In any case, here are
>>> the files:
>>> 
>>> http://theshore.net/~caker/xen/BUGfatal_page_fault/
>> 
>> I will take a look. It might help narrow down the possibilities a bit.
>> 

My analysis is that the hypervisor crashed because one of the entries in a
dying guest's third-level page directory has the present bit (bit 0) set,
yet the physical address mapped by that entry is 0xb72dedde51000. That is a
rather large and obviously bogus number! It causes us to access way off the
end of an array indexed by physical address, resulting in a fatal page
fault.

Obviously the question is: Where did the bogus address come from?

That's going to be rather hard to answer without finding a more reliable
repro of the bug, and then adding some hypervisor tracing.

 -- Keir

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc5: FATAL PAGE FAULT
  2008-04-03 15:55   ` Keir Fraser
@ 2008-04-22 18:19     ` Christopher S. Aker
  2008-04-22 18:46       ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Christopher S. Aker @ 2008-04-22 18:19 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen devel

Keir Fraser wrote:
> That's going to be rather hard to answer without finding a more reliable
> repro of the bug, and then adding some hypervisor tracing.

Here are two more Xen traces with this problem.  These always appear to 
occur after we're forced to destroy a domain.  The first trace is a 
DoubleDump<tm> and has something new in the second dump...

http://www.theshore.net/~caker/xen/build-1.11/

I still don't have a method to reproduce, but since we're hitting this 
with some frequency, would it be worth it to stick in some extra 
debugging now?


====== First trace ======

----[ Xen-3.2.1-rc5  x86_64  debug=y  Not tainted ]----
CPU:    1
RIP:    e008:[<ffff828c8013dee4>] put_page_type+0x17/0x107
RFLAGS: 0000000000210286   CONTEXT: hypervisor
rax: 00001da2f4162bf0   rbx: ffffa026f4162bf0   rcx: 0000000080000000
rdx: 000000bdac808de6   rsi: 0000000000402fe3   rdi: ffffa026f4162bf0
rbp: ffff8300cf13fbf8   rsp: ffff8300cf13fbd8   r8:  0000000000000000
r9:  00000000deadbeef   r10: ffff828c801c5bf0   r11: 0000000000000000
r12: 0000000000000000   r13: ffffa026f4162bf0   r14: 0000000000402fe3
r15: ffff82840a077b78   cr0: 000000008005003b   cr4: 00000000000026b0
cr3: 000000062ffdd000   cr2: ffffa026f4162c08
ds: 007b   es: 007b   fs: 0000   gs: 0033   ss: 0000   cs: e008
Xen stack trace from rsp=ffff8300cf13fbd8:
    0000000000000002 ffffa026f4162bf0 0000000000000000 ffff8300cee48100
    ffff8300cf13fc18 ffff828c8013b3bb 0000000000200202 ffff830402fe3000
    ffff8300cf13fc58 ffff828c8013bfcd 00000000cee48100 ffff8300cee48100
    ffff82840a077b78 000000004c000001 ffff8300cee48100 ffff8300cee48118
    ffff8300cf13fc88 ffff828c8013de4a 0000000044000001 000000004c000000
    ffff82840a077b78 ffff8300cee48100 ffff8300cf13fcb8 ffff828c8013df63
    00007cff30ec0337 ffff82840a077b78 0000000000000003 00000000004011a4
    ffff8300cf13fcd8 ffff828c8013b409 ffff8300cf13fd68 ffff8304011a4018
    ffff8300cf13fd08 ffff828c8013c06d ffff8300cee48100 ffff82840a02c1a0
    0000000068000001 ffff8300cee490f8 ffff8300cf13fd38 ffff828c8013de5a
    0000000060000001 0000000068000000 ffff82840a02c1a0 ffff8300cee490f8
    ffff8300cf13fd68 ffff828c8013df63 ffff82840a02c1a0 ffff82840a02c1a0
    ffff82840a02c1a0 ffff8300cee48100 ffff8300cf13fdb8 ffff828c80131680
    0000000088000000 0000000080000000 ffff8300cf13ff28 ffff8300cee48100
    ffff8300cee48100 00000000b4dfc508 0000000000000000 0000000000000000
    ffff8300cf13fdd8 ffff828c80131a94 ffff8300cee48100 0000000000000000
    ffff8300cf13fe08 ffff828c80105638 ffff82840f448b58 ffff8300cf13fe28
    00000000b4dfc508 fffffffffffffff3 ffff8300cf13ff08 ffff828c8010479f
    00000000000000fb ffff8300cee3a130 ffff8300cf13fe68 ffff828c8011c746
    0000000000200282 ffff8300ceefe118 0000000500000002 083010acb7ab000a
Xen call trace:
    [<ffff828c8013dee4>] put_page_type+0x17/0x107
    [<ffff828c8013b3bb>] put_page_from_l2e+0x3f/0x4e
    [<ffff828c8013bfcd>] free_l2_table+0xa6/0xce
    [<ffff828c8013de4a>] free_page_type+0x1c4/0x247
    [<ffff828c8013df63>] put_page_type+0x96/0x107
    [<ffff828c8013b409>] put_page_from_l3e+0x3f/0x4e
    [<ffff828c8013c06d>] free_l3_table+0x78/0xc4
    [<ffff828c8013de5a>] free_page_type+0x1d4/0x247
    [<ffff828c8013df63>] put_page_type+0x96/0x107
    [<ffff828c80131680>] relinquish_memory+0xce/0x262
    [<ffff828c80131a94>] domain_relinquish_resources+0xd1/0x1b0
    [<ffff828c80105638>] domain_kill+0x77/0x164
    [<ffff828c8010479f>] do_domctl+0x4dd/0xc1e
    [<ffff828c801bdd50>] compat_tracing_off+0xb/0x64

Pagetable walk from ffffa026f4162c08:
  L4[0x140] = 0000000000000000 ffffffffffffffff

****************************************
Panic on CPU 1:
FATAL PAGE FAULT
[error_code=0000]
Faulting linear address: ffffa026f4162c08
****************************************

Reboot in five seconds...

...3 seconds later, this occurred...

Assertion '__cpus_subset(&(cpumask), &(cpu_online_map), 32)' failed at 
smp.c:84
----[ Xen-3.2.1-rc5  x86_64  debug=y  Not tainted ]----
CPU:    0
RIP:    e008:[<ffff828c80145c68>] send_IPI_mask_flat+0x29/0x9c
RFLAGS: 0000000000010002   CONTEXT: hypervisor
rax: 00000000fffffffe   rbx: ffff8300cee3c100   rcx: 0000000000000003
rdx: 0000000000000040   rsi: 00000000000000fc   rdi: 0000000000000004
rbp: ffff828c80237be8   rsp: ffff828c80237bd0   r8:  ffff828c8024c780
r9:  0000000000000002   r10: 00000000deadbeef   r11: 0000000000000000
r12: 0000000000000004   r13: 00000000000000fc   r14: 0000000000000010
r15: 00001485db7a5091   cr0: 000000008005003b   cr4: 00000000000026b0
cr3: 00000003ff15a000   cr2: 00000000e3015078
ds: 007b   es: 007b   fs: 00d8   gs: 0000   ss: 0000   cs: e008
Xen stack trace from rsp=ffff828c80237bd0:
    ffff8300cee3c100 0000000000000086 0000000000000000 ffff828c80237c08
    ffff828c8014601a ffff8300cee30f00 0000000000000004 ffff828c80237c38
    ffff828c80114da0 0000000000000004 ffff828c80137fe0 0000000000000004
    ffff828c8025951c ffff828c80237c68 ffff828c80119b18 ffff828c80237c98
    ffff828c80137ac2 ffff8300cee3c100 ffff8300cfdd4100 ffff828c80237c98
    ffff828c80107409 00000000c0621300 ffff8300cfdd4100 ffff8300cee30f00
    0000000000000000 ffff828c80237ca8 ffff828c801075c9 ffff828c80237cd8
    ffff828c80137fe0 ffff828c80259500 ffff828c8025951c 0000000000000098
    ffff828c80237d38 ffff828c80237d28 ffff828c80137ac2 0000000000000082
    0000000000000000 ffff828c80237d18 0000000000000009 00000000ffffffff
    ffff828c801ebb60 ffff828c8020e100 00001485db7a5091 00007d737fdc82a7
    ffff828c801336e6 00001485db7a5091 ffff828c8020e100 ffff828c801ebb60
    00000000ffffffff ffff828c80237de8 0000000000000009 0000000000000000
    00000000deadbeef 0000000000000000 0000000000000000 000000007d9b040e
    000000007d8a4358 000000000000290c 00000000001e8480 00000000000003e8
    0000009800000000 ffff828c8012ac48 000000000000e008 0000000000000216
    ffff828c80237de8 0000000000000000 00001485db7a5091 ffff828c80237e08
    ffff828c80146257 ffff828c80237f28 ffff828c8020e534 ffff828c80237e28
    ffff828c80145b9a ffff828c80237f28 ffff828c8020e534 ffff828c80237e38
    ffff828c80146312 00007d737fdc8197 ffff828c801347a0 00001485db7a5091
Xen call trace:
    [<ffff828c80145c68>] send_IPI_mask_flat+0x29/0x9c
    [<ffff828c8014601a>] smp_send_event_check_mask+0x3e/0x40
    [<ffff828c80114da0>] csched_vcpu_wake+0x242/0x259
    [<ffff828c80119b18>] vcpu_wake+0x12d/0x248
    [<ffff828c80107409>] evtchn_set_pending+0xe5/0x15c
    [<ffff828c801075c9>] send_guest_pirq+0x61/0x63
    [<ffff828c80137fe0>] __do_IRQ_guest+0x19c/0x1b2
    [<ffff828c80137ac2>] do_IRQ+0x5a/0x1a7
    [<ffff828c801336e6>] common_interrupt+0x26/0x30
    [<ffff828c8012ac48>] __udelay+0x30/0x48
    [<ffff828c80146257>] smp_send_stop+0x39/0x67
    [<ffff828c80145b9a>] machine_restart+0x4f/0xc5
    [<ffff828c80146312>] smp_call_function_interrupt+0x79/0xa7
    [<ffff828c801347a0>] call_function_interrupt+0x30/0x40
    [<ffff828c8012c73b>] default_idle+0x2f/0x34
    [<ffff828c8012c7ff>] idle_loop+0x70/0x77


****************************************
Panic on CPU 0:
Assertion '__cpus_subset(&(cpumask), &(cpu_online_map), 32)' failed at 
smp.c:84
****************************************

Reboot in five seconds...


====== Second trace ======

----[ Xen-3.2.1-rc5  x86_64  debug=y  Not tainted ]----
CPU:    0
RIP:    e008:[<ffff828c8013dee4>] put_page_type+0x17/0x107
RFLAGS: 0000000000210286   CONTEXT: hypervisor
rax: 00000a51169fd050   rbx: ffff8cd5169fd050   rcx: 0000000080000000
rdx: 0000004206f73202   rsi: 00000000004041e1   rdi: ffff8cd5169fd050
rbp: ffff828c80237bf8   rsp: ffff828c80237bd8   r8:  0000000000000000
r9:  00000000deadbeef   r10: ffff828c801c5bf0   r11: 0000000000000000
r12: 0000000000000000   r13: ffff8cd5169fd050   r14: 00000000004041e1
r15: ffff82840a0a4b28   cr0: 000000008005003b   cr4: 00000000000026b0
cr3: 000000062ffd9000   cr2: ffff8cd5169fd068
ds: 007b   es: 007b   fs: 0000   gs: 0033   ss: 0000   cs: e008
Xen stack trace from rsp=ffff828c80237bd8:
    ffff828409df5d01 ffff8cd5169fd050 0000000000000000 ffff8300ceea0100
    ffff828c80237c18 ffff828c8013b3bb 0000000400000004 ffff8304041e1000
    ffff828c80237c58 ffff828c8013bfcd 00000003f2f24027 ffff8300ceea0100
    ffff82840a0a4b28 0000000048000001 ffff8300ceea0100 ffff8300ceea0118
    ffff828c80237c88 ffff828c8013de4a 0000000040000001 0000000048000000
    ffff82840a0a4b28 ffff8300ceea0100 ffff828c80237cb8 ffff828c8013df63
    0000000000000000 ffff82840a0a4b28 0000000000000000 0000000000402dd4
    ffff828c80237cd8 ffff828c8013b409 ffff8300ceea0100 ffff830402dd4000
    ffff828c80237d08 ffff828c8013c06d ffff8300ceea0100 ffff82840a072920
    0000000068000001 ffff8300ceea10f8 ffff828c80237d38 ffff828c8013de5a
    0000000060000001 0000000068000000 ffff82840a072920 ffff8300ceea10f8
    ffff828c80237d68 ffff828c8013df63 ffff82840a072920 ffff82840a072920
    ffff82840a072920 ffff8300ceea0100 ffff828c80237db8 ffff828c80131680
    0000000088000000 0000000080000000 ffff828c80237f28 ffff8300ceea0100
    ffff8300ceea0100 00000000b2cf9868 0000000000000000 0000000000000000
    ffff828c80237dd8 ffff828c80131a94 ffff8300ceea0100 0000000000000000
    ffff828c80237e08 ffff828c80105638 ffff828c80237e18 ffff828c80114da0
    00000000b2cf9868 fffffffffffffff3 ffff828c80237f08 ffff828c8010479f
    ffff828c80237e48 ffff8300cee36130 0000000000000000 000078cdfb20f27f
    0000000000200282 0000000000000000 0000000500000002 081d66ecb7af0010
Xen call trace:
    [<ffff828c8013dee4>] put_page_type+0x17/0x107
    [<ffff828c8013b3bb>] put_page_from_l2e+0x3f/0x4e
    [<ffff828c8013bfcd>] free_l2_table+0xa6/0xce
    [<ffff828c8013de4a>] free_page_type+0x1c4/0x247
    [<ffff828c8013df63>] put_page_type+0x96/0x107
    [<ffff828c8013b409>] put_page_from_l3e+0x3f/0x4e
    [<ffff828c8013c06d>] free_l3_table+0x78/0xc4
    [<ffff828c8013de5a>] free_page_type+0x1d4/0x247
    [<ffff828c8013df63>] put_page_type+0x96/0x107
    [<ffff828c80131680>] relinquish_memory+0xce/0x262
    [<ffff828c80131a94>] domain_relinquish_resources+0xd1/0x1b0
    [<ffff828c80105638>] domain_kill+0x77/0x164
    [<ffff828c8010479f>] do_domctl+0x4dd/0xc1e
    [<ffff828c801bdd50>] compat_tracing_off+0xb/0x64

Pagetable walk from ffff8cd5169fd068:
  L4[0x119] = 0000000000000000 ffffffffffffffff

****************************************
Panic on CPU 0:
FATAL PAGE FAULT
[error_code=0000]
Faulting linear address: ffff8cd5169fd068
****************************************

Reboot in five seconds...

-Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc5: FATAL PAGE FAULT
  2008-04-22 18:19     ` Xen 3.2.1-rc5: " Christopher S. Aker
@ 2008-04-22 18:46       ` Keir Fraser
  2008-04-22 19:39         ` Christopher S. Aker
  0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2008-04-22 18:46 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: xen devel

On 22/4/08 19:19, "Christopher S. Aker" <caker@theshore.net> wrote:

> Here are two more Xen traces with this problem.  These always appear to
> occur after we're forced to destroy a domain.  The first trace is a
> DoubleDump<tm> and has something new in the second dump...
> 
> http://www.theshore.net/~caker/xen/build-1.11/
> 
> I still don't have a method to reproduce, but since we're hitting this
> with some frequency, would it be worth it to stick in some extra
> debugging now?

The second crash is just some overzealous asserting. Easily fixed but also
not very interesting, unfortunately.

The two main backtraces are exactly the same bug as you saw last time.
Except in this case you have bogus nonsense in a pair of L2 pagetable
entries, whereas last time the garbage was in an L3 entry.

My best guess just now, seeing as noone else has reported ever seeing this,
is that maybe you have a bad driver or hardware corrupting memory? Obviously
that's a bit of a stab in the dark though.

Have you seen this particular type of crash on multiple different machines?
If so, are they different types of machine?

 -- Keir

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc5: FATAL PAGE FAULT
  2008-04-22 18:46       ` Keir Fraser
@ 2008-04-22 19:39         ` Christopher S. Aker
  2008-04-22 20:21           ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Christopher S. Aker @ 2008-04-22 19:39 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen devel

Keir Fraser wrote:
> The second crash is just some overzealous asserting. Easily fixed but also
> not very interesting, unfortunately.
> 
> The two main backtraces are exactly the same bug as you saw last time.
> Except in this case you have bogus nonsense in a pair of L2 pagetable
> entries, whereas last time the garbage was in an L3 entry.
> 
> My best guess just now, seeing as noone else has reported ever seeing this,
> is that maybe you have a bad driver or hardware corrupting memory? Obviously
> that's a bit of a stab in the dark though.
> 
> Have you seen this particular type of crash on multiple different machines?
> If so, are they different types of machine?

Two machines thus far, both are of identical software and hardware 
configurations.

Now that it looks like the 3ware issues have been corrected in post-Xen 
3.1 dom0, I'll update our boxes from 2.6.16.x to 2.6.18.8 and hope for 
the best.

Thanks for your help so far.

-Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc5: FATAL PAGE FAULT
  2008-04-22 19:39         ` Christopher S. Aker
@ 2008-04-22 20:21           ` Keir Fraser
  2008-04-28 14:02             ` Christopher S. Aker
  0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2008-04-22 20:21 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: xen devel

On 22/4/08 20:39, "Christopher S. Aker" <caker@theshore.net> wrote:

>> My best guess just now, seeing as noone else has reported ever seeing this,
>> is that maybe you have a bad driver or hardware corrupting memory? Obviously
>> that's a bit of a stab in the dark though.
>> 
>> Have you seen this particular type of crash on multiple different machines?
>> If so, are they different types of machine?
> 
> Two machines thus far, both are of identical software and hardware
> configurations.

Have you been running this type of workload on a variety of hardware, or are
you limited in the range of types of hardware that you're testing on? This
might indicate whether it is significant that you have only seen the crash
on a single hardware type.

 -- Keir

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc5: FATAL PAGE FAULT
  2008-04-22 20:21           ` Keir Fraser
@ 2008-04-28 14:02             ` Christopher S. Aker
  2008-04-28 14:44               ` Keir Fraser
  0 siblings, 1 reply; 10+ messages in thread
From: Christopher S. Aker @ 2008-04-28 14:02 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen devel

Keir Fraser wrote:
> On 22/4/08 20:39, "Christopher S. Aker" <caker@theshore.net> wrote:
> 
>>> My best guess just now, seeing as noone else has reported ever seeing this,
>>> is that maybe you have a bad driver or hardware corrupting memory? Obviously
>>> that's a bit of a stab in the dark though.
>>>
>>> Have you seen this particular type of crash on multiple different machines?
>>> If so, are they different types of machine?
>> Two machines thus far, both are of identical software and hardware
>> configurations.
> 
> Have you been running this type of workload on a variety of hardware, or are
> you limited in the range of types of hardware that you're testing on? This
> might indicate whether it is significant that you have only seen the crash
> on a single hardware type.

Make that three machines.  They're all of the same config.  This 
identical hardware config runs fine under non-Xen.  It also only occurs 
when a domain is being destroyed, so I wouldn't suspect this is a driver 
issue or memory corruption given the pattern.  Xen is most suspect, in 
my mind.

Will you provide me with some debugging code that'll make these 
occurrences more useful in tracking down the problem the next time it 
triggers?

(XEN) Pagetable walk from 00000000c16e3f30:
(XEN)  L4[0x000] = 00000002bfe8d027 00000000000258e3
(XEN)  L3[0x003] = 646c696843206120 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 84 (vcpu#2) crashed on cpu#1:
(XEN) ----[ Xen-3.2.1-rc1  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    1
(XEN) RIP:    0061:[<00000000c0101347>]
(XEN) RFLAGS: 0000000000010246   CONTEXT: guest
(XEN) rax: 0000000000000000   rbx: 00000000deadbeef   rcx: 00000000deadbeef
(XEN) rdx: 00000000deadbeef   rsi: 00000000deadbeef   rdi: 00000000c7006030
(XEN) rbp: 00000000c16e3fac   rsp: 00000000c16e3f38   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000000026b0
(XEN) cr3: 000000060f4c8000   cr2: 00000000c0101347
(XEN) ds: 007b   es: 007b   fs: 0000   gs: 0000   ss: 0069   cs: 0061
(XEN) Guest stack trace from esp=c16e3f38:
(XEN)  Fault while accessing guest memory.
(XEN) ----[ Xen-3.2.1-rc1  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    5
(XEN) RIP:    e008:[<ffff828c8013dee4>] put_page_type+0x17/0x107
(XEN) RFLAGS: 0000000000210282   CONTEXT: hypervisor
(XEN) rax: 000006162f512f98   rbx: ffff889a2f512f98   rcx: 6765746143206568
(XEN) rdx: 00000026f4620797   rsi: 00000000002bfe8d   rdi: ffff889a2f512f98
(XEN) rbp: ffff8300cfde7cb8   rsp: ffff8300cfde7c98   r8:  0000000000000000
(XEN) r9:  00000000deadbeef   r10: ffff828c801c5bf0   r11: 0000000000000000
(XEN) r12: 0000000000000001   r13: ffff889a2f512f98   r14: ffff8300cee88100
(XEN) r15: ffff8300cee88118   cr0: 000000008005003b   cr4: 00000000000026b0
(XEN) cr3: 000000062ffdf000   cr2: ffff889a2f512fb0
(XEN) ds: 007b   es: 007b   fs: 0000   gs: 0033   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff8300cfde7c98:
(XEN)    ffff8300cfde7ca8 ffff889a2f512f98 0000000000000001 
00000000002bfe8d
(XEN)    ffff8300cfde7cd8 ffff828c8013b409 ffff8300cee88100 
ffff8302bfe8d008
(XEN)    ffff8300cfde7d08 ffff828c8013c06d ffff8300cee88100 
ffff828406dfc608
(XEN)    0000000068000001 ffff8300cee890f8 ffff8300cfde7d38 
ffff828c8013de5a
(XEN)    0000000060000001 0000000068000000 ffff828406dfc608 
ffff8300cee890f8
(XEN)    ffff8300cfde7d68 ffff828c8013df63 ffff828406dfc608 
ffff828406dfc608
(XEN)    ffff828406dfc608 ffff8300cee88100 ffff8300cfde7db8 
ffff828c80131680
(XEN)    0000000088000000 0000000080000000 ffff8300cfde7f28 
ffff8300cee88100
(XEN)    ffff8300cee88100 00000000b4dfb508 0000000000000000 
0000000000000000
(XEN)    ffff8300cfde7dd8 ffff828c80131a94 ffff8300cee88100 
0000000000000000
(XEN)    ffff8300cfde7e08 ffff828c80105638 ffff8300cfde7e08 
ffff828c8014601a
(XEN)    00000000b4dfb508 fffffffffffffff3 ffff8300cfde7f08 
ffff828c8010479f
(XEN)    0000000000000001 0000000000000000 0000000000000001 
0000000000000000
(XEN)    ffff8300cfde7e68 0000000000200286 0000000500000002 
082ebba4b7b80054
(XEN)    0836d2a401dfb538 b7ddfc50b7b8f68c b7aa53e400000001 
00000001b7a2ecdc
(XEN)    080facafb4dfb568 081361e0082f17c0 080797e7b775bf0c 
00000000b775bf28
(XEN)    b7dda02c00000060 b76f084c00000000 0805946cb4dfb588 
b7dda02cb76f084c
(XEN)    b7ddd6a000000000 00000002b765eeac a5dba1eea5dba1ee 
0000001f00000000
(XEN)    0000000000000010 ffff8300cee3c100 0000000000000000 
0000000000000000
(XEN)    0000000000000000 0000000000000000 00007cff302180b7 
ffff828c801bdd50
(XEN) Xen call trace:
(XEN)    [<ffff828c8013dee4>] put_page_type+0x17/0x107
(XEN)    [<ffff828c8013b409>] put_page_from_l3e+0x3f/0x4e
(XEN)    [<ffff828c8013c06d>] free_l3_table+0x78/0xc4
(XEN)    [<ffff828c8013de5a>] free_page_type+0x1d4/0x247
(XEN)    [<ffff828c8013df63>] put_page_type+0x96/0x107
(XEN)    [<ffff828c80131680>] relinquish_memory+0xce/0x262
(XEN)    [<ffff828c80131a94>] domain_relinquish_resources+0xd1/0x1b0
(XEN)    [<ffff828c80105638>] domain_kill+0x77/0x164
(XEN)    [<ffff828c8010479f>] do_domctl+0x4dd/0xc1e
(XEN)    [<ffff828c801bdd50>] compat_tracing_off+0xb/0x64
(XEN)
(XEN) Pagetable walk from ffff889a2f512fb0:
(XEN)  L4[0x111] = 0000000000000000 ffffffffffffffff
(XEN)
(XEN) ****************************************
(XEN) Panic on CPU 5:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff889a2f512fb0
(XEN) ****************************************
(XEN)
(XEN) Reboot in five seconds...

Thanks,
-Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc5: FATAL PAGE FAULT
  2008-04-28 14:02             ` Christopher S. Aker
@ 2008-04-28 14:44               ` Keir Fraser
  2008-04-28 15:00                 ` Christopher S. Aker
  0 siblings, 1 reply; 10+ messages in thread
From: Keir Fraser @ 2008-04-28 14:44 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: xen devel

On 28/4/08 15:02, "Christopher S. Aker" <caker@theshore.net> wrote:

> Make that three machines.  They're all of the same config.  This
> identical hardware config runs fine under non-Xen.  It also only occurs
> when a domain is being destroyed, so I wouldn't suspect this is a driver
> issue or memory corruption given the pattern.  Xen is most suspect, in
> my mind.
> 
> Will you provide me with some debugging code that'll make these
> occurrences more useful in tracking down the problem the next time it
> triggers?

I suggest you try repro'ing on a slightly different hardware configuration.
For example, a different storage controller. Did you repro with a 2.6.18
dom0 yet?

 -- Keir

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Xen 3.2.1-rc5: FATAL PAGE FAULT
  2008-04-28 14:44               ` Keir Fraser
@ 2008-04-28 15:00                 ` Christopher S. Aker
  0 siblings, 0 replies; 10+ messages in thread
From: Christopher S. Aker @ 2008-04-28 15:00 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen devel

Keir Fraser wrote:
> On 28/4/08 15:02, "Christopher S. Aker" <caker@theshore.net> wrote:
>> Will you provide me with some debugging code that'll make these
>> occurrences more useful in tracking down the problem the next time it
>> triggers?
> 
> I suggest you try repro'ing on a slightly different hardware configuration.
> For example, a different storage controller. Did you repro with a 2.6.18
> dom0 yet?

All of our machines are using 3ware RAID cards, so trying this on 
alternate hardware isn't an option.

We haven't hit this on 2.6.18 dom0 yet.  Newly deployed machines and 
boxes that crash are being updated to 2.6.18 dom0.  I'll keep you posted :)

Thanks,
-Chris

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2008-04-28 15:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-04-03  4:34 Xen 3.2.1-rc1: FATAL PAGE FAULT Christopher S. Aker
2008-04-03 14:04 ` Christopher S. Aker
2008-04-03 15:55   ` Keir Fraser
2008-04-22 18:19     ` Xen 3.2.1-rc5: " Christopher S. Aker
2008-04-22 18:46       ` Keir Fraser
2008-04-22 19:39         ` Christopher S. Aker
2008-04-22 20:21           ` Keir Fraser
2008-04-28 14:02             ` Christopher S. Aker
2008-04-28 14:44               ` Keir Fraser
2008-04-28 15:00                 ` Christopher S. Aker

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.