All of lore.kernel.org
 help / color / mirror / Atom feed
* 3.1/2 live migration panic
@ 2008-01-16  1:18 John Levon
  2008-01-16  1:47 ` Ian Pratt
  2008-01-16 20:37 ` John Levon
  0 siblings, 2 replies; 26+ messages in thread
From: John Levon @ 2008-01-16  1:18 UTC (permalink / raw)
  To: xen-devel


Is this familiar to anybody? I can reproduce this during live migration
and heavy disk usage in both 3.1 and 3.2 on 64-bit.

regards
john

Xen panic[dom=0xffff8300e2e86100/vcpu=0xffff8300e2edc100]: FATAL PAGE FAULT
[error_code=0000]
Faulting linear address: ffff81c0ffc07c48

        rdi: ffff828c801efdf0 rsi: ffff8300e2ef7a18 rdx: ffff8300e2ef7a48
        rcx:                8  r8:        800000000  r9: ffff8300e3de8100
        rax: ffff8300e2ef7b08 rbx: ffff8300e2edc100 rbp: ffff8300e2ef7af8
        r10: ffff828c80203468 r11: ffff8300e2ef7a88 r12:              282
        r13:       3000000008 r14: ffff828c8013577a r15:        3e2ef79a8
        fsb:                0 gsb: ffffff00ba9dc580  ds:               4b
         es:               4b  fs:                0  gs:              1c3
         cs:             e008 rfl:              282 rsp: ffff8300e2ef7a00
        rip: ffff828c8015b2eb:  ss:                0
        cr0: 8005003b  cr2: ffff81c0ffc07c48  cr3: 1cc4c1000  cr4:      6f0
Xen panic[dom=0xffff8300e2e86100/vcpu=0xffff8300e2edc100]: FATAL PAGE FAULT
[error_code=0000]
Faulting linear address: ffff81c0ffc07c48


ffff8300e2ef7af8 xpv:do_page_fault+13d
ffff8300e2ef7b38 xpv:handle_exception+4b
ffff8300e2ef7b68 0xffff8300e2e86100 (in Xen)
ffff8300e2ef7c58 xpv:sh_page_fault__shadow_4_guest_4+598
ffff8300e2ef7e58 xpv:paging_fault+3c
ffff8300e2ef7e88 xpv:fixup_page_fault+22b
ffff8300e2ef7ed8 xpv:do_page_fault+40
ffff8300e2ef7f18 xpv:handle_exception+4b

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: 3.1/2 live migration panic
  2008-01-16  1:18 3.1/2 live migration panic John Levon
@ 2008-01-16  1:47 ` Ian Pratt
  2008-01-16  1:51   ` John Levon
  2008-01-16 19:45   ` John Levon
  2008-01-16 20:37 ` John Levon
  1 sibling, 2 replies; 26+ messages in thread
From: Ian Pratt @ 2008-01-16  1:47 UTC (permalink / raw)
  To: John Levon, xen-devel; +Cc: Ian Pratt

> Is this familiar to anybody? I can reproduce this during live
migration
> and heavy disk usage in both 3.1 and 3.2 on 64-bit.

What guest OS? How many VCPUs?

Ian
 
> regards
> john
> 
> Xen panic[dom=0xffff8300e2e86100/vcpu=0xffff8300e2edc100]: FATAL PAGE
> FAULT
> [error_code=0000]
> Faulting linear address: ffff81c0ffc07c48
> 
>         rdi: ffff828c801efdf0 rsi: ffff8300e2ef7a18 rdx:
> ffff8300e2ef7a48
>         rcx:                8  r8:        800000000  r9:
> ffff8300e3de8100
>         rax: ffff8300e2ef7b08 rbx: ffff8300e2edc100 rbp:
> ffff8300e2ef7af8
>         r10: ffff828c80203468 r11: ffff8300e2ef7a88 r12:
> 282
>         r13:       3000000008 r14: ffff828c8013577a r15:
> 3e2ef79a8
>         fsb:                0 gsb: ffffff00ba9dc580  ds:
> 4b
>          es:               4b  fs:                0  gs:
> 1c3
>          cs:             e008 rfl:              282 rsp:
> ffff8300e2ef7a00
>         rip: ffff828c8015b2eb:  ss:                0
>         cr0: 8005003b  cr2: ffff81c0ffc07c48  cr3: 1cc4c1000  cr4:
> 6f0
> Xen panic[dom=0xffff8300e2e86100/vcpu=0xffff8300e2edc100]: FATAL PAGE
> FAULT
> [error_code=0000]
> Faulting linear address: ffff81c0ffc07c48
> 
> 
> ffff8300e2ef7af8 xpv:do_page_fault+13d
> ffff8300e2ef7b38 xpv:handle_exception+4b
> ffff8300e2ef7b68 0xffff8300e2e86100 (in Xen)
> ffff8300e2ef7c58 xpv:sh_page_fault__shadow_4_guest_4+598
> ffff8300e2ef7e58 xpv:paging_fault+3c
> ffff8300e2ef7e88 xpv:fixup_page_fault+22b
> ffff8300e2ef7ed8 xpv:do_page_fault+40
> ffff8300e2ef7f18 xpv:handle_exception+4b
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16  1:47 ` Ian Pratt
@ 2008-01-16  1:51   ` John Levon
  2008-01-16 19:45   ` John Levon
  1 sibling, 0 replies; 26+ messages in thread
From: John Levon @ 2008-01-16  1:51 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

On Wed, Jan 16, 2008 at 01:47:00AM -0000, Ian Pratt wrote:

> > Is this familiar to anybody? I can reproduce this during live
> migration
> > and heavy disk usage in both 3.1 and 3.2 on 64-bit.
> 
> What guest OS? How many VCPUs?

Solaris domU+dom0. Both have 4 VCPUs. I can test a Linux domU and try to
reproduce: I'll do that tomorrow.

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16  1:47 ` Ian Pratt
  2008-01-16  1:51   ` John Levon
@ 2008-01-16 19:45   ` John Levon
  1 sibling, 0 replies; 26+ messages in thread
From: John Levon @ 2008-01-16 19:45 UTC (permalink / raw)
  To: Ian Pratt; +Cc: xen-devel

On Wed, Jan 16, 2008 at 01:47:00AM -0000, Ian Pratt wrote:

> > Is this familiar to anybody? I can reproduce this during live
> migration
> > and heavy disk usage in both 3.1 and 3.2 on 64-bit.
> 
> What guest OS? How many VCPUs?

I can reproduce with:

Solaris domU with > 1 VCPU

I can't reproduce with:

Solaris domU with 1 VCPU
Linux domU with 4 VCPUs

Seems like there's some unusual race with SMP Solaris domUs

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16  1:18 3.1/2 live migration panic John Levon
  2008-01-16  1:47 ` Ian Pratt
@ 2008-01-16 20:37 ` John Levon
  2008-01-16 21:43   ` Keir Fraser
  1 sibling, 1 reply; 26+ messages in thread
From: John Levon @ 2008-01-16 20:37 UTC (permalink / raw)
  To: xen-devel

On Wed, Jan 16, 2008 at 01:18:53AM +0000, John Levon wrote:

> ffff8300e2ef7c58 xpv:sh_page_fault__shadow_4_guest_4+598

Looking at what I can of the disasm, this looks like we're here:

2817     /* Make sure there is enough free shadow memory to build a chain of
2818      * shadow tables: one SHADOW_MAX_ORDER chunk will always be enough
2819      * to allocate all we need.  (We never allocate a top-level shadow
2820      * on this path, only a 32b l1, pae l2+1 or 64b l3+2+1) */
2821     shadow_prealloc(d, SHADOW_MAX_ORDER);
2822 
2823     /* Acquire the shadow.  This must happen before we figure out the rights 
2824      * for the shadow entry, since we might promote a page here. */
2825     ptr_sl1e = shadow_get_and_create_l1e(v, &gw, &sl1mfn, ft);
  >----<

So we're taking a fault somewhere in shadow_get_and_create_l1e(). Unfortunately the
exact point doesn't look easy to find, since the stack trace makes no sense:

ffff8300e2ef7b38 xpv`do_page_fault+0x13d(ffff8300e2ef7b48)
ffff8300e2ef7b68 0xffff828c801d354b()
ffff8300e2ef7c58 0xffff8300e2e86100()
ffff8300e2ef7e58 xpv`sh_page_fault__shadow_4_guest_4+0x598()

Looking through the stack by hand, I do see:

> ffff828c8014e5f2=p
                xpv`guest_get_eff_l1e+0xb9

but of course this might just be stack junk.

regardsjohn

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16 20:37 ` John Levon
@ 2008-01-16 21:43   ` Keir Fraser
  2008-01-16 21:52     ` John Levon
  2008-01-16 22:37     ` John Levon
  0 siblings, 2 replies; 26+ messages in thread
From: Keir Fraser @ 2008-01-16 21:43 UTC (permalink / raw)
  To: John Levon, xen-devel

If you have a debug build of Xen then the backtrace should be trustworthy.
Are there addresses in the backtrace that don't look to be within Xen text?
Your backtraces don't appear to be in the usual Xen format, so I'm not
entirely sure what I'm looking at.

 -- Keir

On 16/1/08 20:37, "John Levon" <levon@movementarian.org> wrote:

> On Wed, Jan 16, 2008 at 01:18:53AM +0000, John Levon wrote:
> 
>> ffff8300e2ef7c58 xpv:sh_page_fault__shadow_4_guest_4+598
> 
> Looking at what I can of the disasm, this looks like we're here:
> 
> 2817     /* Make sure there is enough free shadow memory to build a chain of
> 2818      * shadow tables: one SHADOW_MAX_ORDER chunk will always be enough
> 2819      * to allocate all we need.  (We never allocate a top-level shadow
> 2820      * on this path, only a 32b l1, pae l2+1 or 64b l3+2+1) */
> 2821     shadow_prealloc(d, SHADOW_MAX_ORDER);
> 2822 
> 2823     /* Acquire the shadow.  This must happen before we figure out the
> rights 
> 2824      * for the shadow entry, since we might promote a page here. */
> 2825     ptr_sl1e = shadow_get_and_create_l1e(v, &gw, &sl1mfn, ft);
>> ----<
> 
> So we're taking a fault somewhere in shadow_get_and_create_l1e().
> Unfortunately the
> exact point doesn't look easy to find, since the stack trace makes no sense:
> 
> ffff8300e2ef7b38 xpv`do_page_fault+0x13d(ffff8300e2ef7b48)
> ffff8300e2ef7b68 0xffff828c801d354b()
> ffff8300e2ef7c58 0xffff8300e2e86100()
> ffff8300e2ef7e58 xpv`sh_page_fault__shadow_4_guest_4+0x598()
> 
> Looking through the stack by hand, I do see:
> 
>> ffff828c8014e5f2=p
>                 xpv`guest_get_eff_l1e+0xb9
> 
> but of course this might just be stack junk.
> 
> regardsjohn
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16 21:43   ` Keir Fraser
@ 2008-01-16 21:52     ` John Levon
  2008-01-16 22:37     ` John Levon
  1 sibling, 0 replies; 26+ messages in thread
From: John Levon @ 2008-01-16 21:52 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

On Wed, Jan 16, 2008 at 09:43:41PM +0000, Keir Fraser wrote:

> If you have a debug build of Xen then the backtrace should be trustworthy.
> Are there addresses in the backtrace that don't look to be within Xen text?
> Your backtraces don't appear to be in the usual Xen format, so I'm not
> entirely sure what I'm looking at.

I'll try turning off our panic support to see if the xen-reported stack
is any better.

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16 21:43   ` Keir Fraser
  2008-01-16 21:52     ` John Levon
@ 2008-01-16 22:37     ` John Levon
  2008-01-16 23:01       ` Keir Fraser
  1 sibling, 1 reply; 26+ messages in thread
From: John Levon @ 2008-01-16 22:37 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel

On Wed, Jan 16, 2008 at 09:43:41PM +0000, Keir Fraser wrote:

> If you have a debug build of Xen then the backtrace should be trustworthy.
> Are there addresses in the backtrace that don't look to be within Xen text?

Here's what I got without the panic patch (sigh):

(XEN) sh error: sh_page_fault__shadow_4_guest_4(): Recursive shadow fault: lock was taken by sh_page_fault__shadow_4_guest_4
(XEN) ----[ Xen-3.1.2  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e008:[<ffff828c80168822>] shadow_set_l1e+0x32/0x1b0
(XEN) RFLAGS: 0000000000010282   CONTEXT: hypervisor
(XEN) rax: 0000000000000000   rbx: 800000012521c067   rcx: 0000000000132b04
(XEN) rdx: 800000012521c067   rsi: 00000000000000b1   rdi: ffff8300e2ed2080
(XEN) rbp: ffff8300e2e0fc08   rsp: ffff8300e2e0fbc8   r8:  0000000000000006
(XEN) r9:  0000000000000006   r10: 0000000132b05118   r11: 0000000132b07ff0
(XEN) r12: ffff8300e2ed2080   r13: 00000000000000b1   r14: 0000000000132b04
(XEN) r15: 0000000000000001   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 0000000132b07000   cr2: 00000000000000b1
(XEN) ds: 004b   es: 004b   fs: 0000   gs: 01c3   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff8300e2e0fbc8:
(XEN)    ffff8300e2e0fc08 0000000080167b02 800000012521c067
(XEN)    ffff8300e2e0ff28 ffff8300e2ed2080 ffff8300e2fb6080
(XEN)    ffff828c80221b20 0000000000000001 ffff8300e2e0fe18
(XEN)    ffff828c8016af61 0000000000000000 ffffff0003fd3ac0
(XEN)    0000000000000000 ffff828c80221b20 ffff8300e2e0fd98
(XEN)    ffff828c8010d757 ffff830184aeb000 0000000000000008
(XEN)    ffff8300e2fb6080 ffff828c801c52b8 0000000000132b07
(XEN)    ffff81c0ffc00118 00000000000000b1 ffff8300e2e0fcf0
(XEN)    0000000000000008 000000000012521c 00000006e2e0fd68
(XEN)    ffff8300e2e0fe78 ffffff000475d848 ffff8300e2e06080
(XEN)    ffff8300e2e0fcf8 800000012521c067 0000000132b04067
(XEN)    0000000132b05067 0000000132b06067 0000000000132b06
(XEN)    0000000000132b05 0000000000132b04 ffff8300e2e0fd18
(XEN)    0000000000000082 0000000000003000 ffff8300e2e06080
(XEN)    ffff8300e2e0fd28 ffff828c801355b2 ffff8300e2e0fe78
(XEN)    ffff828c801288db ffff828c801c8100 0000005878a902d7
(XEN)    ffff8300e2ed2080 ffff8300e2e06080 0000000000000086
(XEN)    0000000000003000 ffff8300e2e0ff28 ffff8300e2e06080
(XEN)    ffff8300e2ed2080 ffff8300e2e0fdc0 ffff828c801252d7
(XEN)    820000000000efff ffffff000475d848 ffff8140a0502ff0
(XEN) Xen call trace:
(XEN)    [<ffff828c80168822>] shadow_set_l1e+0x32/0x1b0
(XEN)    [<ffff828c8016af61>] sh_page_fault__shadow_4_guest_4+0xb61/0x10b0
(XEN)    [<ffff828c8013b1c2>] do_page_fault+0x1f2/0x500
(XEN)    [<ffff828c8017a495>] handle_exception_saved+0x2d/0x6b
(XEN)    
(XEN) Pagetable walk from 00000000000000b1:
(XEN)  L4[0x000] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 3:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: 00000000000000b1
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16 22:37     ` John Levon
@ 2008-01-16 23:01       ` Keir Fraser
  2008-01-17  0:10         ` John Levon
                           ` (3 more replies)
  0 siblings, 4 replies; 26+ messages in thread
From: Keir Fraser @ 2008-01-16 23:01 UTC (permalink / raw)
  To: John Levon; +Cc: xen-devel, Tim Deegan

Well that's a lot saner even without being a debug build. Possibly Tim has
some insight into how this can happen... I expect the 'recursive shadow
fault' is simply a result of the fault in shadow_set_l1e() causing an
unexpected re-entry into shadow code. It'd be interesting to know which
invocation of shadow_set_l1e() is on the backtrace. That might be easier to
work out if you can repro the crash with a debug build of Xen.
Alternatively, since there is obviously a very bogus nearly-NULL pointer
involved, perhaps you could add some tracing to pick up on that? Possibly
the sl1e argument to shadow_set_l1e() is the thing that is bogus here.

 -- Keir

On 16/1/08 22:37, "John Levon" <levon@movementarian.org> wrote:

> On Wed, Jan 16, 2008 at 09:43:41PM +0000, Keir Fraser wrote:
> 
>> If you have a debug build of Xen then the backtrace should be trustworthy.
>> Are there addresses in the backtrace that don't look to be within Xen text?
> 
> Here's what I got without the panic patch (sigh):
> 
> (XEN) sh error: sh_page_fault__shadow_4_guest_4(): Recursive shadow fault:
> lock was taken by sh_page_fault__shadow_4_guest_4
> (XEN) ----[ Xen-3.1.2  x86_64  debug=n  Not tainted ]----
> (XEN) CPU:    3
> (XEN) RIP:    e008:[<ffff828c80168822>] shadow_set_l1e+0x32/0x1b0
> (XEN) RFLAGS: 0000000000010282   CONTEXT: hypervisor
> (XEN) rax: 0000000000000000   rbx: 800000012521c067   rcx: 0000000000132b04
> (XEN) rdx: 800000012521c067   rsi: 00000000000000b1   rdi: ffff8300e2ed2080
> (XEN) rbp: ffff8300e2e0fc08   rsp: ffff8300e2e0fbc8   r8:  0000000000000006
> (XEN) r9:  0000000000000006   r10: 0000000132b05118   r11: 0000000132b07ff0
> (XEN) r12: ffff8300e2ed2080   r13: 00000000000000b1   r14: 0000000000132b04
> (XEN) r15: 0000000000000001   cr0: 000000008005003b   cr4: 00000000000006f0
> (XEN) cr3: 0000000132b07000   cr2: 00000000000000b1
> (XEN) ds: 004b   es: 004b   fs: 0000   gs: 01c3   ss: 0000   cs: e008
> (XEN) Xen stack trace from rsp=ffff8300e2e0fbc8:
> (XEN)    ffff8300e2e0fc08 0000000080167b02 800000012521c067
> (XEN)    ffff8300e2e0ff28 ffff8300e2ed2080 ffff8300e2fb6080
> (XEN)    ffff828c80221b20 0000000000000001 ffff8300e2e0fe18
> (XEN)    ffff828c8016af61 0000000000000000 ffffff0003fd3ac0
> (XEN)    0000000000000000 ffff828c80221b20 ffff8300e2e0fd98
> (XEN)    ffff828c8010d757 ffff830184aeb000 0000000000000008
> (XEN)    ffff8300e2fb6080 ffff828c801c52b8 0000000000132b07
> (XEN)    ffff81c0ffc00118 00000000000000b1 ffff8300e2e0fcf0
> (XEN)    0000000000000008 000000000012521c 00000006e2e0fd68
> (XEN)    ffff8300e2e0fe78 ffffff000475d848 ffff8300e2e06080
> (XEN)    ffff8300e2e0fcf8 800000012521c067 0000000132b04067
> (XEN)    0000000132b05067 0000000132b06067 0000000000132b06
> (XEN)    0000000000132b05 0000000000132b04 ffff8300e2e0fd18
> (XEN)    0000000000000082 0000000000003000 ffff8300e2e06080
> (XEN)    ffff8300e2e0fd28 ffff828c801355b2 ffff8300e2e0fe78
> (XEN)    ffff828c801288db ffff828c801c8100 0000005878a902d7
> (XEN)    ffff8300e2ed2080 ffff8300e2e06080 0000000000000086
> (XEN)    0000000000003000 ffff8300e2e0ff28 ffff8300e2e06080
> (XEN)    ffff8300e2ed2080 ffff8300e2e0fdc0 ffff828c801252d7
> (XEN)    820000000000efff ffffff000475d848 ffff8140a0502ff0
> (XEN) Xen call trace:
> (XEN)    [<ffff828c80168822>] shadow_set_l1e+0x32/0x1b0
> (XEN)    [<ffff828c8016af61>] sh_page_fault__shadow_4_guest_4+0xb61/0x10b0
> (XEN)    [<ffff828c8013b1c2>] do_page_fault+0x1f2/0x500
> (XEN)    [<ffff828c8017a495>] handle_exception_saved+0x2d/0x6b
> (XEN)    
> (XEN) Pagetable walk from 00000000000000b1:
> (XEN)  L4[0x000] = 0000000000000000 ffffffffffffffff
> (XEN) 
> (XEN) ****************************************
> (XEN) Panic on CPU 3:
> (XEN) FATAL PAGE FAULT
> (XEN) [error_code=0000]
> (XEN) Faulting linear address: 00000000000000b1
> (XEN) ****************************************
> (XEN) 
> (XEN) Reboot in five seconds...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16 23:01       ` Keir Fraser
@ 2008-01-17  0:10         ` John Levon
  2008-01-17  0:11         ` John Levon
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 26+ messages in thread
From: John Levon @ 2008-01-17  0:10 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Tim Deegan

On Wed, Jan 16, 2008 at 11:01:21PM +0000, Keir Fraser wrote:

> Well that's a lot saner even without being a debug build. Possibly Tim has

I totally missed that I didn't have a debug build. My little script to
build Xen itself was broken, and wasn't setting 'debug=y'. This also
explains the bad Solaris panic stack (no frame pointer, and we expect
it). There was also another problem[1].

I'll reproduce in debug mode, and look some more at stuff around
shadow_set_l1e() as you suggest, and get back to you soon.

cheers
john

[1] because we don't want to get stuck, the Solaris panic path doesn't
take any locks, and that means no console_start_sync(), so we were
leaving the printk serial buffer unprinted.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16 23:01       ` Keir Fraser
  2008-01-17  0:10         ` John Levon
@ 2008-01-17  0:11         ` John Levon
  2008-01-17  2:42         ` John Levon
  2008-01-17  9:24         ` Tim Deegan
  3 siblings, 0 replies; 26+ messages in thread
From: John Levon @ 2008-01-17  0:11 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Tim Deegan

On Wed, Jan 16, 2008 at 11:01:21PM +0000, Keir Fraser wrote:

> > (XEN) sh error: sh_page_fault__shadow_4_guest_4(): Recursive shadow fault:
> > lock was taken by sh_page_fault__shadow_4_guest_4
> > (XEN) ----[ Xen-3.1.2  x86_64  debug=n  Not tainted ]----

This one might be bogus... noticed a problem. I'll report back with the
proper panic shortly

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16 23:01       ` Keir Fraser
  2008-01-17  0:10         ` John Levon
  2008-01-17  0:11         ` John Levon
@ 2008-01-17  2:42         ` John Levon
  2008-01-17  8:10           ` Keir Fraser
  2008-01-17  8:11           ` Keir Fraser
  2008-01-17  9:24         ` Tim Deegan
  3 siblings, 2 replies; 26+ messages in thread
From: John Levon @ 2008-01-17  2:42 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Tim Deegan

On Wed, Jan 16, 2008 at 11:01:21PM +0000, Keir Fraser wrote:

> Well that's a lot saner even without being a debug build. Possibly Tim has

Right, I added something to dig out the pending serial console buffer:

> ::serlog
.1.2-xvm  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e008:[<ffff828c801b4848>] shadow_get_and_create_l1e+0x47/0x32f
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: ffff81c0ffc07c48   rbx: ffff8300e2e86100   rcx: 0000000000000000
(XEN) rdx: ffff8300e2ef7dd8   rsi: ffff8300e2ef7ad0   rdi: 00000000001d2cab
(XEN) rbp: ffff8300e2ef7c58   rsp: ffff8300e2ef7bf8   r8:  0000000000000006
(XEN) r9:  0000000000000006   r10: ffff8300e2edc100   r11: ffffff00f683e540
(XEN) r12: ffffff00b96fa858   r13: ffffff00b96f8050   r14: ffffff01f12d1e20
(XEN) r15: ffffff0105d8b980   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 00000001cc4c1000   cr2: ffff81c0ffc07c48
(XEN) ds: 004b   es: 004b   fs: 0000   gs: 01c3   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff8300e2ef7bf8:
(XEN)    ffff828404e45778 0000000000000000 00000000001f4efc
(XEN)    0000000000000050 0000000000000002 ffff81c0ffc07c48
(XEN)    00000000001d2cab ffff8300e2e87280 00000006e2e20100
(XEN)    ffff8300e2ef7da8 ffff8300e2ef7dd8 ffff8300e2edc100
(XEN)    ffff8300e2ef7e58 ffff828c801b68e8 0000000000000000
(XEN)    ffff8300e2e2ac38 ffff8301bc91ffc0 00000000001bc91f
(XEN)    0000000000071916 ffff8300e2ef7d98 0000000000000006
(XEN)    0000000100000006 ffff8300e2ef7d18 ffff828c8014e1fa
(XEN)    0000000498000004 ffff830000000008 000000040076ddc8
(XEN)    00000006a0000005 0000000100000000 0000000100000008
(XEN)    0000000098000004 ffff8300e2ef7f28 ffff8300e2ef7d38
(XEN)    ffff8300e2ef7f28 ffff8300e2ef7d48 0000000000000086
(XEN)    ffff828c80296838 000000038023d2a0 ffff8300e2ef7d48
(XEN)    ffff828c8015c5ef ffff828c80296838 ffff8300e2ef7f28
(XEN)    ffff8300e2ef7d68 ffff828c8015c5b9 00000002e2ef7d78
(XEN)    ffff828c8027c780 00007cff1d108267 ffff828c801464e1
(XEN)    ffffff00b5428008 0000000000000000 0000000000000008
(XEN)    ffff817f80f89690 ffff8300e2ef7e60 000000088027c780
(XEN)    0000000000000000 ffff8300e2ef7e58 0000000000094720
(XEN)    ffff828c8014e5f2 0000000000094720 0000002700000002
(XEN) Xen call trace:
(XEN)    [<ffff828c801b4848>] shadow_get_and_create_l1e+0x47/0x32f
(XEN)    [<ffff828c801b68e8>] sh_page_fault__shadow_4_guest_4+0x598/0xb9e
(XEN)    [<ffff828c80162fff>] paging_fault+0x3c/0x3e
(XEN)    [<ffff828c80162fa9>] fixup_page_fault+0x22b/0x245
(XEN)    [<ffff828c80163041>] do_page_fault+0x40/0x15c
(XEN)    
(XEN) Pagetable walk from ffff81c0ffc07c48:
(XEN)  L4[0x103] = 00000001cc4c1063 0000000000015d86
(XEN)  L3[0x103] = 00000001cc4c1063 0000000000015d86
(XEN)  L2[0x1fe] = 00000001a8d36067 0000000000015b61 
(XEN)  L1[0x007] = 0000000000000000 ffffffffffffffff

> shadow_get_and_create_l1e+0x47::dis
xpv`shadow_get_and_create_l1e+0x1a:     leaq   -0x30(%rbp),%rdx
xpv`shadow_get_and_create_l1e+0x1e:     movq   -0x10(%rbp),%rsi
xpv`shadow_get_and_create_l1e+0x22:     movq   -0x8(%rbp),%rdi
xpv`shadow_get_and_create_l1e+0x26:     call   -0x107   <xpv`shadow_get_and_create_l2e>
xpv`shadow_get_and_create_l1e+0x2b:     movq   %rax,-0x38(%rbp)
xpv`shadow_get_and_create_l1e+0x2f:     cmpq   $0x0,-0x38(%rbp)
xpv`shadow_get_and_create_l1e+0x34:     jne    +0xd     <xpv`shadow_get_and_create_l1e+0x43>
xpv`shadow_get_and_create_l1e+0x36:     movq   $0x0,-0x58(%rbp)
xpv`shadow_get_and_create_l1e+0x3e:     jmp    +0x2df   <xpv`shadow_get_and_create_l1e+0x322>
xpv`shadow_get_and_create_l1e+0x43:     movq   -0x38(%rbp),%rax
xpv`shadow_get_and_create_l1e+0x47:     movq   (%rax),%rdi
xpv`shadow_get_and_create_l1e+0x4a:     call   -0x8f3   <xpv`shadow_l2e_get_flags>
xpv`shadow_get_and_create_l1e+0x4f:     andl   $0x1,%eax
xpv`shadow_get_and_create_l1e+0x52:     testl  %eax,%eax
xpv`shadow_get_and_create_l1e+0x54:     je     +0x96    <xpv`shadow_get_and_create_l1e+0xf0>
xpv`shadow_get_and_create_l1e+0x5a:     cmpl   $0x6,-0x1c(%rbp)
xpv`shadow_get_and_create_l1e+0x5e:     jne    +0x3d    <xpv`shadow_get_and_create_l1e+0x9d>
xpv`shadow_get_and_create_l1e+0x60:     movq   -0x10(%rbp),%rax
xpv`shadow_get_and_create_l1e+0x64:     movq   0x8(%rax),%rax
xpv`shadow_get_and_create_l1e+0x68:     movl   (%rax),%edi
xpv`shadow_get_and_create_l1e+0x6a:     call   -0x1f1b  <xpv`guest_l2e_get_flags>

(I'm back on 3.1 bits here)

1894     sl2e = shadow_get_and_create_l2e(v, gw, &sl2mfn, ft);
1895     if ( sl2e == NULL ) return NULL;
1896     /* Install the sl1 in the l2e if it wasn't there or if we need to
1897      * re-do it to fix a PSE dirty bit. */
1898     if ( shadow_l2e_get_flags(*sl2e) & _PAGE_PRESENT

So sl2e is non-zero, but bogus:

> ffff81c0ffc07c48::dump
                    0 1 2 3  4 5 6 7 \/ 9 a b  c d e f  01234567v9abcdef
mdb: failed to read data at 0xffff81c0ffc07c48: no mapping for address

This pointer is a constant though (right?)

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-17  2:42         ` John Levon
@ 2008-01-17  8:10           ` Keir Fraser
  2008-01-17 10:53             ` Tim Deegan
  2008-01-17  8:11           ` Keir Fraser
  1 sibling, 1 reply; 26+ messages in thread
From: Keir Fraser @ 2008-01-17  8:10 UTC (permalink / raw)
  To: John Levon; +Cc: xen-devel, Tim Deegan


On 17/1/08 02:42, "John Levon" <levon@movementarian.org> wrote:

> On Wed, Jan 16, 2008 at 11:01:21PM +0000, Keir Fraser wrote:
> 
> (I'm back on 3.1 bits here)
> 
> 1894     sl2e = shadow_get_and_create_l2e(v, gw, &sl2mfn, ft);
> 1895     if ( sl2e == NULL ) return NULL;
> 1896     /* Install the sl1 in the l2e if it wasn't there or if we need to
> 1897      * re-do it to fix a PSE dirty bit. */
> 1898     if ( shadow_l2e_get_flags(*sl2e) & _PAGE_PRESENT
> 
> So sl2e is non-zero, but bogus:
> 
>> ffff81c0ffc07c48::dump
>                     0 1 2 3  4 5 6 7 \/ 9 a b  c d e f  01234567v9abcdef
> mdb: failed to read data at 0xffff81c0ffc07c48: no mapping for address
> 
> This pointer is a constant though (right?)

What do you mean by 'a constant'? It's a pointer into the guest linear
pagetable, which I suppose is what we expect, and for some reason there is
no PTE at that location to be read. Clearly a higher-level page directory is
missing. Possibly shadow code has got confused and thought a page directory
was present when it wasn't, or perhaps the page directory went away (and/or
was in the process of disappearing from TLBs) as the shadow fault handler
went about its business. I'm sure Tim will have some insights. :-)

 -- Keir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-17  2:42         ` John Levon
  2008-01-17  8:10           ` Keir Fraser
@ 2008-01-17  8:11           ` Keir Fraser
  2008-01-17 16:28             ` John Levon
  1 sibling, 1 reply; 26+ messages in thread
From: Keir Fraser @ 2008-01-17  8:11 UTC (permalink / raw)
  To: John Levon; +Cc: xen-devel, Tim Deegan

On 17/1/08 02:42, "John Levon" <levon@movementarian.org> wrote:

> Right, I added something to dig out the pending serial console buffer:
> 
>> ::serlog
> .1.2-xvm  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    3
> (XEN) RIP:    e008:[<ffff828c801b4848>] shadow_get_and_create_l1e+0x47/0x32f

Oh, also this debug build backtrace is very different from the non-debug
one. Do you think the non-debug backtrace was totally bogus, or are we
looking at a common-mode fault that can have both symptoms (i.e, almost-NULL
pointer in shadow_set_l1e(); *and* bogus linear pagetable pointer in
shadow_get_and_create_l1e())?

 -- Keir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-16 23:01       ` Keir Fraser
                           ` (2 preceding siblings ...)
  2008-01-17  2:42         ` John Levon
@ 2008-01-17  9:24         ` Tim Deegan
  3 siblings, 0 replies; 26+ messages in thread
From: Tim Deegan @ 2008-01-17  9:24 UTC (permalink / raw)
  To: xen-devel; +Cc: John Levon

At 23:01 +0000 on 16 Jan (1200524481), Keir Fraser wrote:
> Well that's a lot saner even without being a debug build. Possibly Tim has
> some insight into how this can happen... I expect the 'recursive shadow
> fault' is simply a result of the fault in shadow_set_l1e() causing an
> unexpected re-entry into shadow code.

Yep, that's exactly it.  It's there to stop an unexpected #PF in the
shadow code itself from being even more confusing by having the shadow
code try to handle it and then fail in some other, wierder, way later.
Instead, we bail out and let the normal fatal-page-fault handler take
over.

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-17  8:10           ` Keir Fraser
@ 2008-01-17 10:53             ` Tim Deegan
  2008-01-17 22:25               ` John Levon
  0 siblings, 1 reply; 26+ messages in thread
From: Tim Deegan @ 2008-01-17 10:53 UTC (permalink / raw)
  To: John Levon; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

At 08:10 +0000 on 17 Jan (1200557415), Keir Fraser wrote:
> What do you mean by 'a constant'? It's a pointer into the guest linear
> pagetable, which I suppose is what we expect, and for some reason there is
> no PTE at that location to be read. Clearly a higher-level page directory is
> missing. Possibly shadow code has got confused and thought a page directory
> was present when it wasn't, or perhaps the page directory went away (and/or
> was in the process of disappearing from TLBs) as the shadow fault handler
> went about its business. I'm sure Tim will have some insights. :-)

Hmm.  Yes, it's a pointer into the (shadow) linear PT, and we've just
checked that it's valid or made it so.  Code inspection has lead to a
lot of dead ends so far; can you try the attached patch?

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 1282 bytes --]

diff -r 0918b4bbffbb xen/arch/x86/mm/shadow/multi.c
--- a/xen/arch/x86/mm/shadow/multi.c	Tue Jan 15 11:19:14 2008 +0000
+++ b/xen/arch/x86/mm/shadow/multi.c	Thu Jan 17 10:28:48 2008 +0000
@@ -1888,11 +1888,28 @@ static shadow_l1e_t * shadow_get_and_cre
                                                 fetch_type_t ft)
 {
     mfn_t sl2mfn;
-    shadow_l2e_t *sl2e;
+    shadow_l2e_t *sl2e, tmp;
 
     /* Get the l2e */
     sl2e = shadow_get_and_create_l2e(v, gw, &sl2mfn, ft);
     if ( sl2e == NULL ) return NULL;
+
+    if ( __copy_from_user(&tmp, sl2e, sizeof(tmp)) != 0 )
+    {
+        local_flush_tlb();
+        if ( __copy_from_user(&tmp, sl2e, sizeof(tmp)) != 0 )
+            SHADOW_ERROR("Can't see the l2e, even with TLB flush");
+        else
+            SHADOW_ERROR("TLB flush made the l2e readable!");
+        show_page_walk((unsigned long) sl2e);
+        print_gw(gw);
+        show_page_walk(gw->va);
+        printk("v->arch.shadow_table[0] == %#lx\n", 
+               pagetable_get_pfn(v->arch.shadow_table[0]));
+        printk("CR3 = %#lx\n", read_cr3());
+        WARN();
+    }
+
     /* Install the sl1 in the l2e if it wasn't there or if we need to
      * re-do it to fix a PSE dirty bit. */
     if ( shadow_l2e_get_flags(*sl2e) & _PAGE_PRESENT 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-17  8:11           ` Keir Fraser
@ 2008-01-17 16:28             ` John Levon
  0 siblings, 0 replies; 26+ messages in thread
From: John Levon @ 2008-01-17 16:28 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Tim Deegan

On Thu, Jan 17, 2008 at 08:11:56AM +0000, Keir Fraser wrote:

> Oh, also this debug build backtrace is very different from the non-debug
> one. Do you think the non-debug backtrace was totally bogus, or are we

It was totally bogus, I think (my apologies).

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-17 10:53             ` Tim Deegan
@ 2008-01-17 22:25               ` John Levon
  2008-01-18  9:41                 ` Tim Deegan
  0 siblings, 1 reply; 26+ messages in thread
From: John Levon @ 2008-01-17 22:25 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

On Thu, Jan 17, 2008 at 10:53:12AM +0000, Tim Deegan wrote:

> Hmm.  Yes, it's a pointer into the (shadow) linear PT, and we've just
> checked that it's valid or made it so.  Code inspection has lead to a
> lot of dead ends so far; can you try the attached patch?

I haven't reproduced the same panic yet, but I did get the one below instead.
I'm still trying to get it to go down the path where you added the debugging.

This one looks pretty similar though.

regards
john

(XEN) ----[ Xen-3.1.2-xvm  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    0
(XEN) RIP:    e008:[<ffff828c801b26ba>] shadow_set_l1e+0x4f/0x14c
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: ffff81ff80d64ff0   rbx: ffff8300e2ed8100   rcx: 000000000015c505
(XEN) rdx: 00000001c96ef065   rsi: ffff81ff80d64ff0   rdi: ffff8300e2e56100
(XEN) rbp: ffff828c80267c58   rsp: ffff828c80267c08   r8:  0000000000000002
(XEN) r9:  0000000000000002   r10: ffff8300e2e56100   r11: ffffff015cdaa808
(XEN) r12: ffffff01ac9fe3c0   r13: ffffff02bdbb1648   r14: ffffff02bdbb1540
(XEN) r15: ffffff014e4db008   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 00000001cb03f000   cr2: ffff81ff80d64ff0
(XEN) ds: 004b   es: 004b   fs: 0000   gs: 01c3   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff828c80267c08:
(XEN)    00000000001686df ffff828c80267c18 ffff8301686dfff0
(XEN)    ffff8300e2e56100 ffff8300e2ed8100 0000000080267d98
(XEN)    000000000015c505 00000001c96ef065 ffff81ff80d64ff0
(XEN)    ffff8300e2e56100 ffff828c80267e58 ffff828c801b5efe
(XEN)    0000000800000000 0000000000000004 ffff8301686dfff0
(XEN)    00000000001686df 00000000001c96ef ffff828c80267d98
(XEN)    0000000000000002 0000000100000002 ffff828c80144306
(XEN)    ffff828c8023d2b8 0000027b0000027a ffff828c80267cf0
(XEN)    0000000000000082 00000002e2e56248 0000000100000000
(XEN)    ffff8300e2e02248 00000000cb03f000 ffff828c80267d10
(XEN)    ffff828c80141ad3 ffff8300e2e02248 00000000e2e56100
(XEN)    ffff828c80267db0 ffff828c80141a95 820000060000efff
(XEN)    000000000000ffff ffff828c80267d70 ffff828c8023d2a0
(XEN)    ffff8300e2e56488 00000000000000a8 ffff828c80267f28
(XEN)    0000000000000000 0000002000000000 0000002000000020
(XEN)    ffff828c80267e40 0000000180142209 0000000000000000
(XEN)    0000000000000008 ffff81ff80d64ff0 00000001c96ef065
(XEN)    000000088013c486 000000000015c505 ffff828c80267e58
(XEN)    00000000001c96ef ffff828c8014e5e2 00000000001c96ef
(XEN)    0000002780267e00 ffffff01ac9fe3c8 ffff8140a0502ff0
(XEN) Xen call trace:
(XEN)    [<ffff828c801b26ba>] shadow_set_l1e+0x4f/0x14c
(XEN)    [<ffff828c801b5efe>] sh_page_fault__shadow_4_guest_4+0x6fe/0xb9e
(XEN)    [<ffff828c8016234f>] paging_fault+0x3c/0x3e
(XEN)    [<ffff828c801622f9>] fixup_page_fault+0x22b/0x245
(XEN)    [<ffff828c80162391>] do_page_fault+0x40/0x15c
(XEN)    
(XEN) Pagetable walk from ffff81ff80d64ff0:
(XEN)  L4[0x103] = 00000001cb03f063 000000000000063b
(XEN)  L3[0x1fe] = 00000001d358e067 00000000000005d4
(XEN)  L2[0x006] = 0000000000000000 ffffffffffffffff 
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 0:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff81ff80d64ff0
(XEN) ****************************************

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-17 22:25               ` John Levon
@ 2008-01-18  9:41                 ` Tim Deegan
  2008-01-18 15:53                   ` John Levon
  2008-01-18 16:53                   ` Tim Deegan
  0 siblings, 2 replies; 26+ messages in thread
From: Tim Deegan @ 2008-01-18  9:41 UTC (permalink / raw)
  To: John Levon; +Cc: xen-devel

At 22:25 +0000 on 17 Jan (1200608703), John Levon wrote:
> (XEN) Xen call trace:
> (XEN)    [<ffff828c801b26ba>] shadow_set_l1e+0x4f/0x14c
> (XEN)    [<ffff828c801b5efe>] sh_page_fault__shadow_4_guest_4+0x6fe/0xb9e
> (XEN)    [<ffff828c8016234f>] paging_fault+0x3c/0x3e
> (XEN)    [<ffff828c801622f9>] fixup_page_fault+0x22b/0x245
> (XEN)    [<ffff828c80162391>] do_page_fault+0x40/0x15c
> (XEN)    
> (XEN) Pagetable walk from ffff81ff80d64ff0:
> (XEN)  L4[0x103] = 00000001cb03f063 000000000000063b
> (XEN)  L3[0x1fe] = 00000001d358e067 00000000000005d4
> (XEN)  L2[0x006] = 0000000000000000 ffffffffffffffff 

Hmmm.  This is the same error, one function further down the chain,
which tells us something interesting.  In this case, the shadow l3e has
been written into the l3 linear map and then used successfully via the
l2 linear map to write the l2e, but is now missing when we come to the
l1 linear map.  Bizarre.

Either something has changed the sl4e or sl3e under our feet (surely not
- we have the shadow lock), or it could still be a missing TLB flush.
If we changed the sl4e (from one present entry to another) but didn't
flush the TLB it could cause this.

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-18  9:41                 ` Tim Deegan
@ 2008-01-18 15:53                   ` John Levon
  2008-01-18 16:53                   ` Tim Deegan
  1 sibling, 0 replies; 26+ messages in thread
From: John Levon @ 2008-01-18 15:53 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

On Fri, Jan 18, 2008 at 09:41:05AM +0000, Tim Deegan wrote:

> At 22:25 +0000 on 17 Jan (1200608703), John Levon wrote:
> > (XEN) Xen call trace:
> > (XEN)    [<ffff828c801b26ba>] shadow_set_l1e+0x4f/0x14c
> > (XEN)    [<ffff828c801b5efe>] sh_page_fault__shadow_4_guest_4+0x6fe/0xb9e
> > (XEN)    [<ffff828c8016234f>] paging_fault+0x3c/0x3e
> > (XEN)    [<ffff828c801622f9>] fixup_page_fault+0x22b/0x245
> > (XEN)    [<ffff828c80162391>] do_page_fault+0x40/0x15c
> > (XEN)    
> > (XEN) Pagetable walk from ffff81ff80d64ff0:
> > (XEN)  L4[0x103] = 00000001cb03f063 000000000000063b
> > (XEN)  L3[0x1fe] = 00000001d358e067 00000000000005d4
> > (XEN)  L2[0x006] = 0000000000000000 ffffffffffffffff 
> 
> Hmmm.  This is the same error, one function further down the chain,
> which tells us something interesting.  In this case, the shadow l3e has
> been written into the l3 linear map and then used successfully via the
> l2 linear map to write the l2e, but is now missing when we come to the
> l1 linear map.  Bizarre.
> 
> Either something has changed the sl4e or sl3e under our feet (surely not
> - we have the shadow lock), or it could still be a missing TLB flush.
> If we changed the sl4e (from one present entry to another) but didn't
> flush the TLB it could cause this.

Here's another one. This time, I was running with SHADOW_OPTIMIZATIONS
== 0.

(XEN) Xen call trace:
(XEN)    [<ffff828c8018809b>] shadow_set_l2e+0x41/0x40c
(XEN)    [<ffff828c80189520>] shadow_get_and_create_l1e+0x2b3/0x344
(XEN)    [<ffff828c8018b610>] sh_page_fault__shadow_4_guest_4+0x5a7/0xba3
(XEN)    [<ffff828c8014aad0>] fixup_page_fault+0x1e0/0x1f2
(XEN)    [<ffff828c8014ab8a>] do_page_fault+0xa8/0x186
(XEN)    
(XEN) Pagetable walk from ffff81c0ffc09348:
(XEN)  L4[0x103] = 00000001ca552063 00000000000004b9
(XEN)  L3[0x103] = 00000001ca552063 00000000000004b9
(XEN)  L2[0x1fe] = 00000001ca1e2067 00000000000007cd 

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-18  9:41                 ` Tim Deegan
  2008-01-18 15:53                   ` John Levon
@ 2008-01-18 16:53                   ` Tim Deegan
  2008-01-20 16:55                     ` John Levon
  1 sibling, 1 reply; 26+ messages in thread
From: Tim Deegan @ 2008-01-18 16:53 UTC (permalink / raw)
  To: John Levon; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 532 bytes --]

At 09:41 +0000 on 18 Jan (1200649265), Tim Deegan wrote:
> Either something has changed the sl4e or sl3e under our feet (surely not
> - we have the shadow lock), or it could still be a missing TLB flush.
> If we changed the sl4e (from one present entry to another) but didn't
> flush the TLB it could cause this.

So: another patch for you; can you see if this makes the crashes go away?

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 1414 bytes --]

diff -r 1e6455d608bd xen/arch/x86/mm/shadow/common.c
--- a/xen/arch/x86/mm/shadow/common.c	Fri Jan 18 16:20:47 2008 +0000
+++ b/xen/arch/x86/mm/shadow/common.c	Fri Jan 18 16:44:00 2008 +0000
@@ -593,11 +593,14 @@ int shadow_write_guest_entry(struct vcpu
  * appropriately.  Returns 0 if we page-faulted, 1 for success. */
 {
     int failed;
-    shadow_lock(v->domain);
+    struct domain *d = v->domain;
+    shadow_lock(d);
     failed = __copy_to_user(p, &new, sizeof(new));
     if ( failed != sizeof(new) )
-        sh_validate_guest_entry(v, gmfn, p, sizeof(new));
-    shadow_unlock(v->domain);
+        if ( sh_validate_guest_entry(v, gmfn, p, sizeof(new)) 
+             & SHADOW_SET_FLUSH )
+            flush_tlb_mask(d->domain_dirty_cpumask);
+    shadow_unlock(d);
     return (failed == 0);
 }
 
@@ -609,13 +612,16 @@ int shadow_cmpxchg_guest_entry(struct vc
  * cmpxchg itself was successful. */
 {
     int failed;
+    struct domain *d = v->domain;
     intpte_t t = *old;
-    shadow_lock(v->domain);
+    shadow_lock(d);
     failed = cmpxchg_user(p, t, new);
     if ( t == *old )
-        sh_validate_guest_entry(v, gmfn, p, sizeof(new));
+        if ( sh_validate_guest_entry(v, gmfn, p, sizeof(new))
+             & SHADOW_SET_FLUSH )
+            flush_tlb_mask(d->domain_dirty_cpumask);
     *old = t;
-    shadow_unlock(v->domain);
+    shadow_unlock(d);
     return (failed == 0);
 }
 

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-18 16:53                   ` Tim Deegan
@ 2008-01-20 16:55                     ` John Levon
  2008-01-22  9:45                       ` Tim Deegan
  0 siblings, 1 reply; 26+ messages in thread
From: John Levon @ 2008-01-20 16:55 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

On Fri, Jan 18, 2008 at 04:53:24PM +0000, Tim Deegan wrote:

> At 09:41 +0000 on 18 Jan (1200649265), Tim Deegan wrote:
> > Either something has changed the sl4e or sl3e under our feet (surely not
> > - we have the shadow lock), or it could still be a missing TLB flush.
> > If we changed the sl4e (from one present entry to another) but didn't
> > flush the TLB it could cause this.
> 
> So: another patch for you; can you see if this makes the crashes go away?

I'm afraid not:

ffff828c80267a88 xpv:do_page_fault+13d
ffff828c80267ac8 xpv:handle_exception+4b
ffff828c80267af8 0xffff8300e2e44100 (in Xen)
ffff828c80267be8 xpv:shadow_get_and_create_l1e+26b
ffff828c80267c58 xpv:sh_page_fault__shadow_4_guest_4+598
ffff828c80267e58 xpv:paging_fault+3c
ffff828c80267e88 xpv:fixup_page_fault+22b
ffff828c80267ed8 xpv:do_page_fault+40
ffff828c80267f18 xpv:handle_exception+4b
ffff828c80267f48 eb7c54b8

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-20 16:55                     ` John Levon
@ 2008-01-22  9:45                       ` Tim Deegan
  2008-01-23 19:15                         ` John Levon
  0 siblings, 1 reply; 26+ messages in thread
From: Tim Deegan @ 2008-01-22  9:45 UTC (permalink / raw)
  To: John Levon; +Cc: xen-devel

[-- Attachment #1: Type: text/plain, Size: 495 bytes --]

At 16:55 +0000 on 20 Jan (1200848136), John Levon wrote:
> On Fri, Jan 18, 2008 at 04:53:24PM +0000, Tim Deegan wrote:
> > So: another patch for you; can you see if this makes the crashes go away?
> 
> I'm afraid not:

Argh.  Well, here's more debugging, since you seem to hit the _l1e case
more often.  This patch includes the previous two as well. 

Cheers,

Tim.

-- 
Tim Deegan <Tim.Deegan@citrix.com>
Principal Software Engineer, Citrix Systems (R&D) Ltd.
[Company #02300071, SL9 0DZ, UK.]

[-- Attachment #2: patch --]
[-- Type: text/plain, Size: 3676 bytes --]

diff -r 81d41461e030 xen/arch/x86/mm/shadow/common.c
--- a/xen/arch/x86/mm/shadow/common.c	Fri Jan 18 13:54:44 2008 +0000
+++ b/xen/arch/x86/mm/shadow/common.c	Tue Jan 22 09:40:05 2008 +0000
@@ -662,11 +662,14 @@ int shadow_write_guest_entry(struct vcpu
  * appropriately.  Returns 0 if we page-faulted, 1 for success. */
 {
     int failed;
-    shadow_lock(v->domain);
+    struct domain *d = v->domain;
+    shadow_lock(d);
     failed = __copy_to_user(p, &new, sizeof(new));
     if ( failed != sizeof(new) )
-        sh_validate_guest_entry(v, gmfn, p, sizeof(new));
-    shadow_unlock(v->domain);
+        if ( sh_validate_guest_entry(v, gmfn, p, sizeof(new)) 
+             & SHADOW_SET_FLUSH )
+            flush_tlb_mask(d->domain_dirty_cpumask);
+    shadow_unlock(d);
     return (failed == 0);
 }
 
@@ -678,13 +681,16 @@ int shadow_cmpxchg_guest_entry(struct vc
  * cmpxchg itself was successful. */
 {
     int failed;
+    struct domain *d = v->domain;
     intpte_t t = *old;
-    shadow_lock(v->domain);
+    shadow_lock(d);
     failed = cmpxchg_user(p, t, new);
     if ( t == *old )
-        sh_validate_guest_entry(v, gmfn, p, sizeof(new));
+        if ( sh_validate_guest_entry(v, gmfn, p, sizeof(new))
+             & SHADOW_SET_FLUSH )
+            flush_tlb_mask(d->domain_dirty_cpumask);
     *old = t;
-    shadow_unlock(v->domain);
+    shadow_unlock(d);
     return (failed == 0);
 }
 
diff -r 81d41461e030 xen/arch/x86/mm/shadow/multi.c
--- a/xen/arch/x86/mm/shadow/multi.c	Fri Jan 18 13:54:44 2008 +0000
+++ b/xen/arch/x86/mm/shadow/multi.c	Tue Jan 22 09:42:20 2008 +0000
@@ -1888,11 +1888,28 @@ static shadow_l1e_t * shadow_get_and_cre
                                                 fetch_type_t ft)
 {
     mfn_t sl2mfn;
-    shadow_l2e_t *sl2e;
+    shadow_l2e_t *sl2e, tmp;
 
     /* Get the l2e */
     sl2e = shadow_get_and_create_l2e(v, gw, &sl2mfn, ft);
     if ( sl2e == NULL ) return NULL;
+
+    if ( __copy_from_user(&tmp, sl2e, sizeof(tmp)) != 0 )
+    {
+        local_flush_tlb();
+        if ( __copy_from_user(&tmp, sl2e, sizeof(tmp)) != 0 )
+            SHADOW_ERROR("Can't see the l2e, even with TLB flush");
+        else
+            SHADOW_ERROR("TLB flush made the l2e readable!");
+        show_page_walk((unsigned long) sl2e);
+        print_gw(gw);
+        show_page_walk(gw->va);
+        printk("v->arch.shadow_table[0] == %#lx\n", 
+               pagetable_get_pfn(v->arch.shadow_table[0]));
+        printk("CR3 = %#lx\n", read_cr3());
+        WARN();
+    }
+
     /* Install the sl1 in the l2e if it wasn't there or if we need to
      * re-do it to fix a PSE dirty bit. */
     if ( shadow_l2e_get_flags(*sl2e) & _PAGE_PRESENT 
@@ -2835,6 +2852,25 @@ static int sh_page_fault(struct vcpu *v,
         return 0;
     }
 
+    { 
+        shadow_l1e_t tmp;
+        if ( __copy_from_user(&tmp, ptr_sl1e, sizeof(tmp)) != 0 )
+        {
+            local_flush_tlb();
+            if ( __copy_from_user(&tmp, ptr_sl1e, sizeof(tmp)) != 0 )
+                SHADOW_ERROR("Can't see the l1e, even with TLB flush");
+            else
+                SHADOW_ERROR("TLB flush made the l1e readable!");
+            show_page_walk((unsigned long) ptr_sl1e);
+            print_gw(&gw);
+            show_page_walk(gw.va);
+            printk("v->arch.shadow_table[0] == %#lx\n", 
+                   pagetable_get_pfn(v->arch.shadow_table[0]));
+            printk("CR3 = %#lx\n", read_cr3());
+            WARN();
+        }
+    }
+
     /* Calculate the shadow entry and write it */
     l1e_propagate_from_guest(v, (gw.l1e) ? gw.l1e : &gw.eff_l1e, gw.l1mfn, 
                              gmfn, &sl1e, ft, mmio);

[-- Attachment #3: Type: text/plain, Size: 138 bytes --]

_______________________________________________
Xen-devel mailing list
Xen-devel@lists.xensource.com
http://lists.xensource.com/xen-devel

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-01-22  9:45                       ` Tim Deegan
@ 2008-01-23 19:15                         ` John Levon
  2008-02-01 21:19                           ` Dan Magenheimer
  0 siblings, 1 reply; 26+ messages in thread
From: John Levon @ 2008-01-23 19:15 UTC (permalink / raw)
  To: Tim Deegan; +Cc: xen-devel

On Tue, Jan 22, 2008 at 09:45:41AM +0000, Tim Deegan wrote:

> Argh.  Well, here's more debugging, since you seem to hit the _l1e case
> more often.  This patch includes the previous two as well. 

See below. I also saw "Can't see the l1e" version as well.

cheers
john

(XEN) sh error: shadow_get_and_create_l1e(): Can't see the l2e, even with TLB flushPagetable walk from ffff81c0ffc06928:
(XEN)  L4[0x103] = 00000001d2f4d063 000000000007dd4e
(XEN)  L3[0x103] = 00000001d2f4d063 000000000007dd4e
(XEN)  L2[0x1fe] = 00000001f73ca067 000000000007dc91 
(XEN)  L1[0x006] = 0000000000000000 ffffffffffffffff
(XEN) Pagetable walk from ffffff01a4a6e8f0:
(XEN)  L4[0x1fe] = 00000001f73ca067 000000000007dc91
(XEN)  L3[0x006] = 0000000000000000 ffffffffffffffff
(XEN) v->arch.shadow_table[0] == 0x1d2f4d
(XEN) CR3 = 0x1d2f4d000
(XEN) Xen WARN at multi.c:1910
(XEN) ----[ Xen-3.1.2-xvm  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff828c801b3ca0>] shadow_get_and_create_l1e+0x147/0x46c
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: ffff828c802035e4   rbx: ffff8300e3daa100   rcx: 0000000000000008
(XEN) rdx: ffff828c8027dbf2   rsi: 000000000000000a   rdi: ffff828c802035e4
(XEN) rbp: ffff8300e2e0fc38   rsp: ffff8300e2e0fb98   r8:  00000000ffffffff
(XEN) r9:  00000000ffffffff   r10: ffff828c8027dfdf   r11: ffff828c8027dbe6
(XEN) r12: ffffff01a4a6e8b8   r13: ffffff01a48594c0   r14: ffffff01a4859480
(XEN) r15: ffffff0146e28608   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 00000001d2f4d000   cr2: ffff81c0ffc06928
(XEN) ds: 004b   es: 004b   fs: 0000   gs: 01c3   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff8300e2e0fb98:
(XEN)    00000020e2e0fbc0 ffff8300e2e0fbc0 ffff8300e2e0fbc8
(XEN)    ffff828c8015ba1d ffffffffffffffff 0000000000000000
(XEN)    ffff8300e2e0fc38 ffff8300e2e0fbf8 0000000000000008
(XEN)    ffff81c0ffc06928 0000000000000008 0000000000000008
(XEN)    0000000000000000 ffff81c0ffc06928 000000000015d83e
(XEN)    ffff8300e3dab280 00000006e2eda100 ffff8300e2e0fda8
(XEN)    ffff8300e2e0fdd8 ffff8300e2eca100 ffff8300e2e0fe58
(XEN)    ffff828c801b5da0 000000fc00000000 0000000800000002
(XEN)    0000000000000044 ffff8301c1e7cab8 00000000001c1e7c
(XEN)    00000000001c60c1 ffff8300e2e0fd98 0000000000000006
(XEN)    0000000100000006 000000008015b93f 00000001c60c1065
(XEN)    0000000000000000 ffff8300e2e0fc98 ffff81ff80a5bab8
(XEN)    0000000000000008 0000000000000000 ffff8300e2e0fd20
(XEN)    00000006e2e0fd20 0000000100000000 ffff8300e2e0fd20
(XEN)    ffff8300e2e0fd08 ffff828c8015b5f9 000000208021b300
(XEN)    0000000000000000 0000000000000004 ffff8300e2e0fe90
(XEN)    ffffff000414e4d0 0000000400000020 ffff8300e2e0fe8c
(XEN)    ffffff000414e4cc ffff8300e2e0fd88 ffff828c801668a3
(XEN)    ffff8300e2e0fd68 0000000000000000 0000000000000004
(XEN)    ffff8300e2e0fe8c ffffff000414e4cc 000000008023f4c0
(XEN) Xen call trace:
(XEN)    [<ffff828c801b3ca0>] shadow_get_and_create_l1e+0x147/0x46c
(XEN)    [<ffff828c801b5da0>] sh_page_fault__shadow_4_guest_4+0x598/0xce7
(XEN)    [<ffff828c8016234f>] paging_fault+0x3c/0x3e
(XEN)    [<ffff828c801622f9>] fixup_page_fault+0x22b/0x245
(XEN)    [<ffff828c80162391>] do_page_fault+0x40/0x15c
(XEN)    
(XEN) ----[ Xen-3.1.2-xvm  x86_64  debug=y  Not tainted ]----
(XEN) CPU:    2
(XEN) RIP:    e008:[<ffff828c801b3cb3>] shadow_get_and_create_l1e+0x15a/0x46c
(XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
(XEN) rax: ffff81c0ffc06928   rbx: ffff8300e3daa100   rcx: 0000000000000008
(XEN) rdx: ffff828c8027dbf2   rsi: 000000000000000a   rdi: ffff828c802035e4
(XEN) rbp: ffff8300e2e0fc38   rsp: ffff8300e2e0fb98   r8:  00000000ffffffff
(XEN) r9:  00000000ffffffff   r10: ffff828c8027dfdf   r11: ffff828c8027dbe6
(XEN) r12: ffffff01a4a6e8b8   r13: ffffff01a48594c0   r14: ffffff01a4859480
(XEN) r15: ffffff0146e28608   cr0: 000000008005003b   cr4: 00000000000006f0
(XEN) cr3: 00000001d2f4d000   cr2: ffff81c0ffc06928
(XEN) ds: 004b   es: 004b   fs: 0000   gs: 01c3   ss: 0000   cs: e008
(XEN) Xen stack trace from rsp=ffff8300e2e0fb98:
(XEN)    00000020e2e0fbc0 ffff8300e2e0fbc0 ffff8300e2e0fbc8
(XEN)    ffff828c8015ba1d ffffffffffffffff 0000000000000000
(XEN)    ffff8300e2e0fc38 ffff8300e2e0fbf8 0000000000000008
(XEN)    ffff81c0ffc06928 0000000000000008 0000000000000008
(XEN)    0000000000000000 ffff81c0ffc06928 000000000015d83e
(XEN)    ffff8300e3dab280 00000006e2eda100 ffff8300e2e0fda8
(XEN)    ffff8300e2e0fdd8 ffff8300e2eca100 ffff8300e2e0fe58
(XEN)    ffff828c801b5da0 000000fc00000000 0000000800000002
(XEN)    0000000000000044 ffff8301c1e7cab8 00000000001c1e7c
(XEN)    00000000001c60c1 ffff8300e2e0fd98 0000000000000006
(XEN)    0000000100000006 000000008015b93f 00000001c60c1065
(XEN)    0000000000000000 ffff8300e2e0fc98 ffff81ff80a5bab8
(XEN)    0000000000000008 0000000000000000 ffff8300e2e0fd20
(XEN)    00000006e2e0fd20 0000000100000000 ffff8300e2e0fd20
(XEN)    ffff8300e2e0fd08 ffff828c8015b5f9 000000208021b300
(XEN)    0000000000000000 0000000000000004 ffff8300e2e0fe90
(XEN)    ffffff000414e4d0 0000000400000020 ffff8300e2e0fe8c
(XEN)    ffffff000414e4cc ffff8300e2e0fd88 ffff828c801668a3
(XEN)    ffff8300e2e0fd68 0000000000000000 0000000000000004
(XEN)    ffff8300e2e0fe8c ffffff000414e4cc 000000008023f4c0
(XEN) Xen call trace:
(XEN)    [<ffff828c801b3cb3>] shadow_get_and_create_l1e+0x15a/0x46c
(XEN)    [<ffff828c801b5da0>] sh_page_fault__shadow_4_guest_4+0x598/0xce7
(XEN)    [<ffff828c8016234f>] paging_fault+0x3c/0x3e
(XEN)    [<ffff828c801622f9>] fixup_page_fault+0x22b/0x245
(XEN)    [<ffff828c80162391>] do_page_fault+0x40/0x15c
(XEN)    
(XEN) Pagetable walk from ffff81c0ffc06928:
(XEN)  L4[0x103] = 00000001d2f4d063 000000000007dd4e
(XEN)  L3[0x103] = 00000001d2f4d063 000000000007dd4e
(XEN)  L2[0x1fe] = 00000001f73ca067 000000000007dc91 
(XEN)  L1[0x006] = 0000000000000000 ffffffffffffffff
(XEN) 
(XEN) ****************************************
(XEN) Panic on CPU 2:
(XEN) FATAL PAGE FAULT
(XEN) [error_code=0000]
(XEN) Faulting linear address: ffff81c0ffc06928
(XEN) ****************************************
(XEN) 
(XEN) Reboot in five seconds...

^ permalink raw reply	[flat|nested] 26+ messages in thread

* RE: 3.1/2 live migration panic
  2008-01-23 19:15                         ` John Levon
@ 2008-02-01 21:19                           ` Dan Magenheimer
  2008-02-01 21:24                             ` John Levon
  0 siblings, 1 reply; 26+ messages in thread
From: Dan Magenheimer @ 2008-02-01 21:19 UTC (permalink / raw)
  To: John Levon, Tim Deegan; +Cc: xen-devel@lists.xensource.com

Any progress on this one?  We may be seeing it too (on 3.1.3 near final),
at least the call trace looks very similar to one of the trace's that
John previously posted on this thread.

In our case, the problem occurred on an xm create after heavy usage
for >24 hours.  64-bit Xen, 32-bit dom0, AMD x86_64 x 8, if that helps.

Thanks,
Dan

(XEN) Xen call trace:
(XEN)    [<ffff828c8016a02f>] shadow_set_l1e+0x2f/0x1b0
(XEN)    [<ffff828c8016e5d8>] sh_page_fault__shadow_4_guest_4+0x8e8/0xec0
(XEN)    [<ffff828c80169699>] sh_make_shadow+0x479/0x4b0
(XEN)    [<ffff828c8016d459>] sh_update_cr3__shadow_4_guest_4+0x409/0x510
(XEN)    [<ffff828c80166f85>] shadow_update_paging_modes+0x95/0xd0
(XEN)    [<ffff828c8015906f>] svm_cr_access+0xecf/0xf50
(XEN)    [<ffff828c8015509c>] get_effective_addr_modrm64+0x13c/0x3d0
(XEN)    [<ffff828c8014b1d0>] hvm_io_assist+0xe30/0xe60
(XEN)    [<ffff828c80146297>] hvm_do_resume+0x27/0x150
(XEN)    [<ffff828c80151ff6>] vlapic_has_interrupt+0x26/0x60
(XEN)    [<ffff828c801595c8>] svm_vmexit_handler+0x4d8/0x15f0
(XEN)    [<ffff828c80114676>] vcpu_periodic_timer_work+0x16/0x80
(XEN)    [<ffff828c80151f46>] vlapic_get_ppr+0x26/0xb0
(XEN)    [<ffff828c8014b4d4>] is_isa_irq_masked+0x34/0x90
(XEN)    [<ffff828c80151ff6>] vlapic_has_interrupt+0x26/0x60
(XEN)    [<ffff828c8014b5ac>] cpu_has_pending_irq+0x2c/0x60
(XEN)    [<ffff828c8015b08a>] svm_stgi_label+0x8/0xe

(more crash dump data if needed)

> -----Original Message-----
> From: xen-devel-bounces@lists.xensource.com
> [mailto:xen-devel-bounces@lists.xensource.com]On Behalf Of John Levon
> Sent: Wednesday, January 23, 2008 12:16 PM
> To: Tim Deegan
> Cc: xen-devel@lists.xensource.com
> Subject: Re: [Xen-devel] 3.1/2 live migration panic
> 
> 
> On Tue, Jan 22, 2008 at 09:45:41AM +0000, Tim Deegan wrote:
> 
> > Argh.  Well, here's more debugging, since you seem to hit 
> the _l1e case
> > more often.  This patch includes the previous two as well.
> 
> See below. I also saw "Can't see the l1e" version as well.
> 
> cheers
> john
> 
> (XEN) sh error: shadow_get_and_create_l1e(): Can't see the 
> l2e, even with TLB flushPagetable walk from ffff81c0ffc06928:
> (XEN)  L4[0x103] = 00000001d2f4d063 000000000007dd4e
> (XEN)  L3[0x103] = 00000001d2f4d063 000000000007dd4e
> (XEN)  L2[0x1fe] = 00000001f73ca067 000000000007dc91
> (XEN)  L1[0x006] = 0000000000000000 ffffffffffffffff
> (XEN) Pagetable walk from ffffff01a4a6e8f0:
> (XEN)  L4[0x1fe] = 00000001f73ca067 000000000007dc91
> (XEN)  L3[0x006] = 0000000000000000 ffffffffffffffff
> (XEN) v->arch.shadow_table[0] == 0x1d2f4d
> (XEN) CR3 = 0x1d2f4d000
> (XEN) Xen WARN at multi.c:1910
> (XEN) ----[ Xen-3.1.2-xvm  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    2
> (XEN) RIP:    e008:[<ffff828c801b3ca0>] 
> shadow_get_and_create_l1e+0x147/0x46c
> (XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
> (XEN) rax: ffff828c802035e4   rbx: ffff8300e3daa100   rcx: 
> 0000000000000008
> (XEN) rdx: ffff828c8027dbf2   rsi: 000000000000000a   rdi: 
> ffff828c802035e4
> (XEN) rbp: ffff8300e2e0fc38   rsp: ffff8300e2e0fb98   r8:  
> 00000000ffffffff
> (XEN) r9:  00000000ffffffff   r10: ffff828c8027dfdf   r11: 
> ffff828c8027dbe6
> (XEN) r12: ffffff01a4a6e8b8   r13: ffffff01a48594c0   r14: 
> ffffff01a4859480
> (XEN) r15: ffffff0146e28608   cr0: 000000008005003b   cr4: 
> 00000000000006f0
> (XEN) cr3: 00000001d2f4d000   cr2: ffff81c0ffc06928
> (XEN) ds: 004b   es: 004b   fs: 0000   gs: 01c3   ss: 0000   cs: e008
> (XEN) Xen stack trace from rsp=ffff8300e2e0fb98:
> (XEN)    00000020e2e0fbc0 ffff8300e2e0fbc0 ffff8300e2e0fbc8
> (XEN)    ffff828c8015ba1d ffffffffffffffff 0000000000000000
> (XEN)    ffff8300e2e0fc38 ffff8300e2e0fbf8 0000000000000008
> (XEN)    ffff81c0ffc06928 0000000000000008 0000000000000008
> (XEN)    0000000000000000 ffff81c0ffc06928 000000000015d83e
> (XEN)    ffff8300e3dab280 00000006e2eda100 ffff8300e2e0fda8
> (XEN)    ffff8300e2e0fdd8 ffff8300e2eca100 ffff8300e2e0fe58
> (XEN)    ffff828c801b5da0 000000fc00000000 0000000800000002
> (XEN)    0000000000000044 ffff8301c1e7cab8 00000000001c1e7c
> (XEN)    00000000001c60c1 ffff8300e2e0fd98 0000000000000006
> (XEN)    0000000100000006 000000008015b93f 00000001c60c1065
> (XEN)    0000000000000000 ffff8300e2e0fc98 ffff81ff80a5bab8
> (XEN)    0000000000000008 0000000000000000 ffff8300e2e0fd20
> (XEN)    00000006e2e0fd20 0000000100000000 ffff8300e2e0fd20
> (XEN)    ffff8300e2e0fd08 ffff828c8015b5f9 000000208021b300
> (XEN)    0000000000000000 0000000000000004 ffff8300e2e0fe90
> (XEN)    ffffff000414e4d0 0000000400000020 ffff8300e2e0fe8c
> (XEN)    ffffff000414e4cc ffff8300e2e0fd88 ffff828c801668a3
> (XEN)    ffff8300e2e0fd68 0000000000000000 0000000000000004
> (XEN)    ffff8300e2e0fe8c ffffff000414e4cc 000000008023f4c0
> (XEN) Xen call trace:
> (XEN)    [<ffff828c801b3ca0>] shadow_get_and_create_l1e+0x147/0x46c
> (XEN)    [<ffff828c801b5da0>] 
> sh_page_fault__shadow_4_guest_4+0x598/0xce7
> (XEN)    [<ffff828c8016234f>] paging_fault+0x3c/0x3e
> (XEN)    [<ffff828c801622f9>] fixup_page_fault+0x22b/0x245
> (XEN)    [<ffff828c80162391>] do_page_fault+0x40/0x15c
> (XEN)
> (XEN) ----[ Xen-3.1.2-xvm  x86_64  debug=y  Not tainted ]----
> (XEN) CPU:    2
> (XEN) RIP:    e008:[<ffff828c801b3cb3>] 
> shadow_get_and_create_l1e+0x15a/0x46c
> (XEN) RFLAGS: 0000000000010286   CONTEXT: hypervisor
> (XEN) rax: ffff81c0ffc06928   rbx: ffff8300e3daa100   rcx: 
> 0000000000000008
> (XEN) rdx: ffff828c8027dbf2   rsi: 000000000000000a   rdi: 
> ffff828c802035e4
> (XEN) rbp: ffff8300e2e0fc38   rsp: ffff8300e2e0fb98   r8:  
> 00000000ffffffff
> (XEN) r9:  00000000ffffffff   r10: ffff828c8027dfdf   r11: 
> ffff828c8027dbe6
> (XEN) r12: ffffff01a4a6e8b8   r13: ffffff01a48594c0   r14: 
> ffffff01a4859480
> (XEN) r15: ffffff0146e28608   cr0: 000000008005003b   cr4: 
> 00000000000006f0
> (XEN) cr3: 00000001d2f4d000   cr2: ffff81c0ffc06928
> (XEN) ds: 004b   es: 004b   fs: 0000   gs: 01c3   ss: 0000   cs: e008
> (XEN) Xen stack trace from rsp=ffff8300e2e0fb98:
> (XEN)    00000020e2e0fbc0 ffff8300e2e0fbc0 ffff8300e2e0fbc8
> (XEN)    ffff828c8015ba1d ffffffffffffffff 0000000000000000
> (XEN)    ffff8300e2e0fc38 ffff8300e2e0fbf8 0000000000000008
> (XEN)    ffff81c0ffc06928 0000000000000008 0000000000000008
> (XEN)    0000000000000000 ffff81c0ffc06928 000000000015d83e
> (XEN)    ffff8300e3dab280 00000006e2eda100 ffff8300e2e0fda8
> (XEN)    ffff8300e2e0fdd8 ffff8300e2eca100 ffff8300e2e0fe58
> (XEN)    ffff828c801b5da0 000000fc00000000 0000000800000002
> (XEN)    0000000000000044 ffff8301c1e7cab8 00000000001c1e7c
> (XEN)    00000000001c60c1 ffff8300e2e0fd98 0000000000000006
> (XEN)    0000000100000006 000000008015b93f 00000001c60c1065
> (XEN)    0000000000000000 ffff8300e2e0fc98 ffff81ff80a5bab8
> (XEN)    0000000000000008 0000000000000000 ffff8300e2e0fd20
> (XEN)    00000006e2e0fd20 0000000100000000 ffff8300e2e0fd20
> (XEN)    ffff8300e2e0fd08 ffff828c8015b5f9 000000208021b300
> (XEN)    0000000000000000 0000000000000004 ffff8300e2e0fe90
> (XEN)    ffffff000414e4d0 0000000400000020 ffff8300e2e0fe8c
> (XEN)    ffffff000414e4cc ffff8300e2e0fd88 ffff828c801668a3
> (XEN)    ffff8300e2e0fd68 0000000000000000 0000000000000004
> (XEN)    ffff8300e2e0fe8c ffffff000414e4cc 000000008023f4c0
> (XEN) Xen call trace:
> (XEN)    [<ffff828c801b3cb3>] shadow_get_and_create_l1e+0x15a/0x46c
> (XEN)    [<ffff828c801b5da0>] 
> sh_page_fault__shadow_4_guest_4+0x598/0xce7
> (XEN)    [<ffff828c8016234f>] paging_fault+0x3c/0x3e
> (XEN)    [<ffff828c801622f9>] fixup_page_fault+0x22b/0x245
> (XEN)    [<ffff828c80162391>] do_page_fault+0x40/0x15c
> (XEN)
> (XEN) Pagetable walk from ffff81c0ffc06928:
> (XEN)  L4[0x103] = 00000001d2f4d063 000000000007dd4e
> (XEN)  L3[0x103] = 00000001d2f4d063 000000000007dd4e
> (XEN)  L2[0x1fe] = 00000001f73ca067 000000000007dc91
> (XEN)  L1[0x006] = 0000000000000000 ffffffffffffffff
> (XEN)
> (XEN) ****************************************
> (XEN) Panic on CPU 2:
> (XEN) FATAL PAGE FAULT
> (XEN) [error_code=0000]
> (XEN) Faulting linear address: ffff81c0ffc06928
> (XEN) ****************************************
> (XEN)
> (XEN) Reboot in five seconds...
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xensource.com
> http://lists.xensource.com/xen-devel
>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: 3.1/2 live migration panic
  2008-02-01 21:19                           ` Dan Magenheimer
@ 2008-02-01 21:24                             ` John Levon
  0 siblings, 0 replies; 26+ messages in thread
From: John Levon @ 2008-02-01 21:24 UTC (permalink / raw)
  To: Dan Magenheimer; +Cc: xen-devel@lists.xensource.com, Tim Deegan

On Fri, Feb 01, 2008 at 02:19:40PM -0700, Dan Magenheimer wrote:

> Any progress on this one?  We may be seeing it too (on 3.1.3 near final),
> at least the call trace looks very similar to one of the trace's that
> John previously posted on this thread.

It does look pretty similar.

> In our case, the problem occurred on an xm create after heavy usage
> for >24 hours.  64-bit Xen, 32-bit dom0, AMD x86_64 x 8, if that helps.

More details on that AMD box?

It transpires that I can only reproduce it on one single machine, a
4-way AMD Sun Fire V40Z. I'm investigating if there's a BIOS update
needed at the moment.

I've tested on a number of other Intel and AMD boxes and can't reproduce
the problem.

regards
john

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2008-02-01 21:24 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-16  1:18 3.1/2 live migration panic John Levon
2008-01-16  1:47 ` Ian Pratt
2008-01-16  1:51   ` John Levon
2008-01-16 19:45   ` John Levon
2008-01-16 20:37 ` John Levon
2008-01-16 21:43   ` Keir Fraser
2008-01-16 21:52     ` John Levon
2008-01-16 22:37     ` John Levon
2008-01-16 23:01       ` Keir Fraser
2008-01-17  0:10         ` John Levon
2008-01-17  0:11         ` John Levon
2008-01-17  2:42         ` John Levon
2008-01-17  8:10           ` Keir Fraser
2008-01-17 10:53             ` Tim Deegan
2008-01-17 22:25               ` John Levon
2008-01-18  9:41                 ` Tim Deegan
2008-01-18 15:53                   ` John Levon
2008-01-18 16:53                   ` Tim Deegan
2008-01-20 16:55                     ` John Levon
2008-01-22  9:45                       ` Tim Deegan
2008-01-23 19:15                         ` John Levon
2008-02-01 21:19                           ` Dan Magenheimer
2008-02-01 21:24                             ` John Levon
2008-01-17  8:11           ` Keir Fraser
2008-01-17 16:28             ` John Levon
2008-01-17  9:24         ` Tim Deegan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.