dom0 pvops crash apparently due to guest migration

All of lore.kernel.org
 help / color / mirror / Atom feed

* dom0 pvops crash apparently due to guest migration
@ 2010-11-29 11:59 Ian Jackson
  2010-11-29 18:53 ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 5+ messages in thread
From: Ian Jackson @ 2010-11-29 11:59 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel

One of my test boxes encountered the crash whose oops you see below.
It doesn't do it every time, or even every time on this machine (since
the credit2 test in the same run worked).  The crash seems to have
occurred just at the end of the migration of a PV guest.

The setup is 32-bit dom0 and domU on 64-bit Xen.
The pvops kernel version was 56eabf9f2a6632d3b2ef.

The complete logs are here:
  http://www.chiark.greenend.org.uk/~xensrcts/logs/2847/test-amd64-i386-xl-multivcpu/
(The machine has since been reused so those logs are what there is.)

Ian.

------------[ cut here ]------------
kernel BUG at arch/x86/mm/fault.c:210!
invalid opcode: 0000 [#1] SMP 
last sysfs file: /sys/devices/virtual/net/lo/operstate
Modules linked in: e1000e [last unloaded: scsi_wait_scan]

Pid: 22, comm: xenwatch Not tainted (2.6.32.26 #1)         
EIP: 0061:[<c104c058>] EFLAGS: 00010082 CPU: 0
EIP is at vmalloc_sync_one+0x118/0x128
EAX: 003f8360 EBX: 1fc1b067 ECX: ffffffe0 EDX: ab273fff
ESI: 00000000 EDI: c182adf0 EBP: dfcdbe88 ESP: dfcdbe64
 DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
Process xenwatch (pid: 22, ti=dfcda000 task=dfccc510 task.ti=dfcda000)
Stack:
 dbd7b384 00cdbe88 00000000 c568f200 dbd7b384 ab273fff f7c00000 c568f200
<0> dbd7b384 dfcdbea8 c104ca9a c182adf0 c1780204 dbd75f40 dfd45a20 dbd75f40
<0> dfcdbf5c dfcdbeb4 c10df14a dfcdbf1c dfcdbef8 c12313b1 0000001b 00000008
Call Trace:
 [<c104ca9a>] ? vmalloc_sync_all+0x5c/0xbe
 [<c10df14a>] ? alloc_vm_area+0x44/0x4b
 [<c12313b1>] ? blkif_map+0x2d/0x204
 [<c1230cbb>] ? frontend_changed+0x194/0x209
 [<c1229b39>] ? xenbus_otherend_changed+0x5c/0x61
 [<c1229c97>] ? frontend_changed+0xa/0xd
 [<c1228783>] ? xenwatch_thread+0xf6/0x11e
 [<c10795df>] ? autoremove_wake_function+0x0/0x33
 [<c122868d>] ? xenwatch_thread+0x0/0x11e
 [<c1079397>] ? kthread+0x61/0x66
 [<c1079336>] ? kthread+0x0/0x66
 [<c1030dd7>] ? kernel_thread_helper+0x7/0x10
Code: eb fe 89 d8 89 f2 ff 15 08 7d 68 c1 89 d6 8b 55 f0 89 c3 89 c8 0f ac d0 0c 89 c1 89 d8 0f ac f0 0c c1 e1 05 c1 e0 05 39 c1 74 06 <0f> 0b eb fe 31 ff 83 c4 18 89 f8 5b 5e 5f 5d c3 55 89 e5 56 53 
EIP: [<c104c058>] vmalloc_sync_one+0x118/0x128 SS:ESP 0069:dfcdbe64
---[ end trace 7b608ed9c5e5ed4e ]---

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dom0 pvops crash apparently due to guest migration
  2010-11-29 11:59 dom0 pvops crash apparently due to guest migration Ian Jackson
@ 2010-11-29 18:53 ` Jeremy Fitzhardinge
  2010-11-30 11:45   ` Ian Jackson
  2010-11-30 12:17   ` Keir Fraser
  0 siblings, 2 replies; 5+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-29 18:53 UTC (permalink / raw)
  To: Ian Jackson; +Cc: xen-devel

On 11/29/2010 03:59 AM, Ian Jackson wrote:
> One of my test boxes encountered the crash whose oops you see below.
> It doesn't do it every time, or even every time on this machine (since
> the credit2 test in the same run worked).  The crash seems to have
> occurred just at the end of the migration of a PV guest.

Do you have a feel for what the likelihood of failure is?  Has this
started happening recently?

> The setup is 32-bit dom0 and domU on 64-bit Xen.
> The pvops kernel version was 56eabf9f2a6632d3b2ef.
>
> The complete logs are here:
>   http://www.chiark.greenend.org.uk/~xensrcts/logs/2847/test-amd64-i386-xl-multivcpu/
> (The machine has since been reused so those logs are what there is.)
>
> Ian.
>
> ------------[ cut here ]------------
> kernel BUG at arch/x86/mm/fault.c:210!
> invalid opcode: 0000 [#1] SMP 
> last sysfs file: /sys/devices/virtual/net/lo/operstate
> Modules linked in: e1000e [last unloaded: scsi_wait_scan]
>
> Pid: 22, comm: xenwatch Not tainted (2.6.32.26 #1)         
> EIP: 0061:[<c104c058>] EFLAGS: 00010082 CPU: 0
> EIP is at vmalloc_sync_one+0x118/0x128
> EAX: 003f8360 EBX: 1fc1b067 ECX: ffffffe0 EDX: ab273fff
> ESI: 00000000 EDI: c182adf0 EBP: dfcdbe88 ESP: dfcdbe64
>  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
> Process xenwatch (pid: 22, ti=dfcda000 task=dfccc510 task.ti=dfcda000)
> Stack:
>  dbd7b384 00cdbe88 00000000 c568f200 dbd7b384 ab273fff f7c00000 c568f200
> <0> dbd7b384 dfcdbea8 c104ca9a c182adf0 c1780204 dbd75f40 dfd45a20 dbd75f40
> <0> dfcdbf5c dfcdbeb4 c10df14a dfcdbf1c dfcdbef8 c12313b1 0000001b 00000008
> Call Trace:
>  [<c104ca9a>] ? vmalloc_sync_all+0x5c/0xbe
>  [<c10df14a>] ? alloc_vm_area+0x44/0x4b

Hm, I'm still not really sure why alloc_vm_area() does a
vmalloc_sync_all in the first place...  But that BUG shouldn't happen
regardless.

    J

>  [<c12313b1>] ? blkif_map+0x2d/0x204
>  [<c1230cbb>] ? frontend_changed+0x194/0x209
>  [<c1229b39>] ? xenbus_otherend_changed+0x5c/0x61
>  [<c1229c97>] ? frontend_changed+0xa/0xd
>  [<c1228783>] ? xenwatch_thread+0xf6/0x11e
>  [<c10795df>] ? autoremove_wake_function+0x0/0x33
>  [<c122868d>] ? xenwatch_thread+0x0/0x11e
>  [<c1079397>] ? kthread+0x61/0x66
>  [<c1079336>] ? kthread+0x0/0x66
>  [<c1030dd7>] ? kernel_thread_helper+0x7/0x10
> Code: eb fe 89 d8 89 f2 ff 15 08 7d 68 c1 89 d6 8b 55 f0 89 c3 89 c8 0f ac d0 0c 89 c1 89 d8 0f ac f0 0c c1 e1 05 c1 e0 05 39 c1 74 06 <0f> 0b eb fe 31 ff 83 c4 18 89 f8 5b 5e 5f 5d c3 55 89 e5 56 53 
> EIP: [<c104c058>] vmalloc_sync_one+0x118/0x128 SS:ESP 0069:dfcdbe64
> ---[ end trace 7b608ed9c5e5ed4e ]---
>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: dom0 pvops crash apparently due to guest migration
  2010-11-29 18:53 ` Jeremy Fitzhardinge
@ 2010-11-30 11:45   ` Ian Jackson
  2010-11-30 12:17   ` Keir Fraser
  1 sibling, 0 replies; 5+ messages in thread
From: Ian Jackson @ 2010-11-30 11:45 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: xen-devel@lists.xensource.com

Jeremy Fitzhardinge writes ("Re: dom0 pvops crash apparently due to guest migration"):
> On 11/29/2010 03:59 AM, Ian Jackson wrote:
> > One of my test boxes encountered the crash whose oops you see below.
> > It doesn't do it every time, or even every time on this machine (since
> > the credit2 test in the same run worked).  The crash seems to have
> > occurred just at the end of the migration of a PV guest.
> 
> Do you have a feel for what the likelihood of failure is?  Has this
> started happening recently?

The probability of failure seems reasonably high.  This is a different
test machine so it is possible that there is something wrong with the
hardware, but all of the tests with the XCP kernel work fine.

> Hm, I'm still not really sure why alloc_vm_area() does a
> vmalloc_sync_all in the first place...  But that BUG shouldn't happen
> regardless.

It's not just blkback; here's one that shows a call trace with netback
instead:

 ------------[ cut here ]------------
 kernel BUG at arch/x86/mm/fault.c:210!
 invalid opcode: 0000 [#1] SMP 
 last sysfs file: /sys/devices/virtual/net/xenbr0/bridge/topology_change_detected
 Modules linked in: e1000e [last unloaded: scsi_wait_scan]
 
 Pid: 22, comm: xenwatch Not tainted (2.6.32.26 #1)         
 EIP: 0061:[<c104c058>] EFLAGS: 00010086 CPU: 0
 EIP is at vmalloc_sync_one+0x118/0x128
 EAX: 00088480 EBX: 04424067 ECX: ffffffe0 EDX: 7c83ffff
 ESI: 00000000 EDI: c182ae00 EBP: dfcdbeb0 ESP: dfcdbe8c
  DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
 Process xenwatch (pid: 22, ti=dfcda000 task=dfccc510 task.ti=dfcda000)
 Stack:
  dff99c44 00cdbeb0 00000000 c568f800 dff99c44 7c83ffff f8000000 c568f800
 <0> dff99c44 dfcdbed0 c104ca9a c182ae00 c1780204 c445d600 cd034dc0 c445d600
 <0> c445d620 dfcdbedc c10df14a fffffff4 dfcdbf34 c1236b30 00000301 00000300
 Call Trace:
  [<c104ca9a>] ? vmalloc_sync_all+0x5c/0xbe
  [<c10df14a>] ? alloc_vm_area+0x44/0x4b
  [<c1236b30>] ? netif_map+0x2d/0x2e3
  [<c10e95c8>] ? kfree+0x111/0x119
  [<c1229291>] ? xenbus_scanf+0x38/0x4b
  [<c1229291>] ? xenbus_scanf+0x38/0x4b
  [<c12361fa>] ? frontend_changed+0x2c3/0x526
  [<c1229b39>] ? xenbus_otherend_changed+0x5c/0x61
  [<c1229c97>] ? frontend_changed+0xa/0xd
  [<c1228783>] ? xenwatch_thread+0xf6/0x11e
  [<c10795df>] ? autoremove_wake_function+0x0/0x33
  [<c122868d>] ? xenwatch_thread+0x0/0x11e
  [<c1079397>] ? kthread+0x61/0x66
  [<c1079336>] ? kthread+0x0/0x66
  [<c1030dd7>] ? kernel_thread_helper+0x7/0x10
 Code: eb fe 89 d8 89 f2 ff 15 08 7d 68 c1 89 d6 8b 55 f0 89 c3 89 c8 0f ac d0 0c 89 c1 89 d8 0f ac f0 0c c1 e1 05 c1 e0 05 39 c1 74 06 <0f> 0b eb fe 31 ff 83 c4 18 89 f8 5b 5e 5f 5d c3 55 89 e5 56 53 
 EIP: [<c104c058>] vmalloc_sync_one+0x118/0x128 SS:ESP 0069:dfcdbe8c
 ---[ end trace 008e317122f8c510 ]---

Ian.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: dom0 pvops crash apparently due to guest migration
  2010-11-29 18:53 ` Jeremy Fitzhardinge
  2010-11-30 11:45   ` Ian Jackson
@ 2010-11-30 12:17   ` Keir Fraser
  2010-11-30 20:37     ` Jeremy Fitzhardinge
  1 sibling, 1 reply; 5+ messages in thread
From: Keir Fraser @ 2010-11-30 12:17 UTC (permalink / raw)
  To: Jeremy Fitzhardinge, Ian Jackson; +Cc: xen-devel

On 29/11/2010 18:53, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:

> Hm, I'm still not really sure why alloc_vm_area() does a
> vmalloc_sync_all in the first place...  But that BUG shouldn't happen
> regardless.

I think vmalloc_sync_all() is required only if alloc_vm_area()'d regions are
used as hypercall buffers. I'm not sure if they ever are, these days. The
sync wouldn't be needed for allocated regions use as shared rings, for
example. You might be able to do a quick audit of users and then remove the
vmalloc_sync_all(). Presumably if v_s_a is planned to go away it'd be nice
to get rid of this usage for that reason also.

 -- Keir

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: dom0 pvops crash apparently due to guest migration
  2010-11-30 12:17   ` Keir Fraser
@ 2010-11-30 20:37     ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 5+ messages in thread
From: Jeremy Fitzhardinge @ 2010-11-30 20:37 UTC (permalink / raw)
  To: Keir Fraser; +Cc: xen-devel, Ian Jackson

On 11/30/2010 04:17 AM, Keir Fraser wrote:
> On 29/11/2010 18:53, "Jeremy Fitzhardinge" <jeremy@goop.org> wrote:
>
>> Hm, I'm still not really sure why alloc_vm_area() does a
>> vmalloc_sync_all in the first place...  But that BUG shouldn't happen
>> regardless.
> I think vmalloc_sync_all() is required only if alloc_vm_area()'d regions are
> used as hypercall buffers. I'm not sure if they ever are, these days.

Doesn't look like it.  And it would need the buffer to be filled out by
one task and then the hypercall issued by another one, which seems unlikely.

And even if we did issue hypercalls from a vmalloc area, it shouldn't be
alloc_vm_area()'s job to make sure that works, since its a generic core
function now.

    J

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2010-11-30 20:37 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-29 11:59 dom0 pvops crash apparently due to guest migration Ian Jackson
2010-11-29 18:53 ` Jeremy Fitzhardinge
2010-11-30 11:45   ` Ian Jackson
2010-11-30 12:17   ` Keir Fraser
2010-11-30 20:37     ` Jeremy Fitzhardinge

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.