[Qemu-devel] Bug in recent postcopy patch

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Bug in recent postcopy patch
@ 2014-10-29 22:27 Gary Hook
  2014-10-30 10:03 ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 6+ messages in thread
From: Gary Hook @ 2014-10-29 22:27 UTC (permalink / raw)
  To: qemu-devel@nongnu.org

*Knock* *knock* *knock* Is this thing on?

I applied the 47 pieces of the recent postcopy patch to 2.1.2 and am
poking around. An attempt to migrate results in a NULL pointer dereference
in savevm.c.  Here is info from gdb:

Most of qemu_savevm_state_pending() succeeds, until it gets to the end.
Here¹s the relevant thread while calling is_active():

(gdb) backtrace
#0  block_is_active (opaque=0x7fb0ae721200 <block_mig_state>) at
block-migration.c:860
#1  0x00007fb0adf4a13a in qemu_savevm_state_pending (f=0x7fb0b01e3a40,
max_size=max_size@entry=0,
    res_non_postcopiable=res_non_postcopiable@entry=0x7fb09d604c90,
res_postcopiable=res_postcopiable@entry=0x7fb09d604c88)
    at /home/hook/src/qemu/postcopy2/savevm.c:983
#2  0x00007fb0ae01bd82 in migration_thread (opaque=0x7fb0ae684420
<current_migration>) at migration.c:1185
#3  0x00007fb0a824d182 in start_thread (arg=0x7fb09d605700) at
pthread_create.c:312
#4  0x00007fb0a7f79fbd in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Q: why is max_size == 0? Does this seem correct?

We look at se->ops:

(gdb) print *se->ops
$9 = {set_params = 0x7fb0ae028820 <block_set_params>, save_state = 0x0,
cancel = 0x7fb0ae028f50 <block_migration_cancel>,
  save_live_complete = 0x7fb0ae0299a0 <block_save_complete>, is_active =
0x7fb0ae028870 <block_is_active>,
  save_live_iterate = 0x7fb0ae029480 <block_save_iterate>, save_live_setup
= 0x7fb0ae029330 <block_save_setup>,
  save_live_pending = 0x7fb0ae028b30 <block_save_pending>, can_postcopy =
0x0, load_state = 0x7fb0ae0288b0 <block_load>}

Why is can_postcopy() NULL?

(gdb) n
qemu_savevm_state_pending (f=0x7fb0b01e3a40, max_size=max_size@entry=0,
res_non_postcopiable=res_non_postcopiable@entry=0x7fb09d604c90,
    res_postcopiable=res_postcopiable@entry=0x7fb09d604c88) at
/home/hook/src/qemu/postcopy2/savevm.c:989
989	        if (se->ops->can_postcopy(se->opaque)) {
(gdb) print *se
$14 = {entry = {tqe_next = 0x7fb0aff9ab30, tqe_prev = 0x7fb0aff88f20},
idstr = "block", '\000' <repeats 250 times>, instance_id = 0,
  alias_id = 0, version_id = 1, section_id = 1, ops = 0x7fb0ae6848e0
<savevm_block_handlers>, vmsd = 0x0,
  opaque = 0x7fb0ae721200 <block_mig_state>, compat = 0x0, is_ram = 1}
(gdb) step

Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) 


The patches appear to have been fully applied, but it would seem that the
savevm_block_handlers structure needs to be updated to populate this
field? Which implies that a new function will have to be written?

Or, if I have missed the obvious, I would appreciate enlightenment.

Thanks,
Gary

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] Bug in recent postcopy patch
  2014-10-29 22:27 [Qemu-devel] Bug in recent postcopy patch Gary Hook
@ 2014-10-30 10:03 ` Dr. David Alan Gilbert
  2014-10-30 16:49   ` Gary Hook
  0 siblings, 1 reply; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-30 10:03 UTC (permalink / raw)
  To: Gary Hook; +Cc: qemu-devel@nongnu.org

* Gary Hook (gary.hook@nimboxx.com) wrote:
> *Knock* *knock* *knock* Is this thing on?

Yes - but only by luck did I notice this; it's normally better
to reply to the thread that posted a patch and cc the authors!

> I applied the 47 pieces of the recent postcopy patch to 2.1.2 and am
> poking around. An attempt to migrate results in a NULL pointer dereference
> in savevm.c.  Here is info from gdb:

I've not tried migrating with block migration; so can you
show the command line you used on qemu and the sequence of commands
you used to trigger the migration?

> Most of qemu_savevm_state_pending() succeeds, until it gets to the end.
> Here¹s the relevant thread while calling is_active():
> 
> (gdb) backtrace
> #0  block_is_active (opaque=0x7fb0ae721200 <block_mig_state>) at
> block-migration.c:860
> #1  0x00007fb0adf4a13a in qemu_savevm_state_pending (f=0x7fb0b01e3a40,
> max_size=max_size@entry=0,
>     res_non_postcopiable=res_non_postcopiable@entry=0x7fb09d604c90,
> res_postcopiable=res_postcopiable@entry=0x7fb09d604c88)
>     at /home/hook/src/qemu/postcopy2/savevm.c:983
> #2  0x00007fb0ae01bd82 in migration_thread (opaque=0x7fb0ae684420
> <current_migration>) at migration.c:1185
> #3  0x00007fb0a824d182 in start_thread (arg=0x7fb09d605700) at
> pthread_create.c:312
> #4  0x00007fb0a7f79fbd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
> 
> Q: why is max_size == 0? Does this seem correct?

Yes, I think that's normal for the 1st time through the loop; (see migration_thread
near the start max_size is initialised to 0).

> We look at se->ops:
> 
> (gdb) print *se->ops
> $9 = {set_params = 0x7fb0ae028820 <block_set_params>, save_state = 0x0,
> cancel = 0x7fb0ae028f50 <block_migration_cancel>,
>   save_live_complete = 0x7fb0ae0299a0 <block_save_complete>, is_active =
> 0x7fb0ae028870 <block_is_active>,
>   save_live_iterate = 0x7fb0ae029480 <block_save_iterate>, save_live_setup
> = 0x7fb0ae029330 <block_save_setup>,
>   save_live_pending = 0x7fb0ae028b30 <block_save_pending>, can_postcopy =
> 0x0, load_state = 0x7fb0ae0288b0 <block_load>}
> 
> Why is can_postcopy() NULL?
> 
> (gdb) n
> qemu_savevm_state_pending (f=0x7fb0b01e3a40, max_size=max_size@entry=0,
> res_non_postcopiable=res_non_postcopiable@entry=0x7fb09d604c90,
>     res_postcopiable=res_postcopiable@entry=0x7fb09d604c88) at
> /home/hook/src/qemu/postcopy2/savevm.c:989
> 989	        if (se->ops->can_postcopy(se->opaque)) {
> (gdb) print *se
> $14 = {entry = {tqe_next = 0x7fb0aff9ab30, tqe_prev = 0x7fb0aff88f20},
> idstr = "block", '\000' <repeats 250 times>, instance_id = 0,
>   alias_id = 0, version_id = 1, section_id = 1, ops = 0x7fb0ae6848e0
> <savevm_block_handlers>, vmsd = 0x0,
>   opaque = 0x7fb0ae721200 <block_mig_state>, compat = 0x0, is_ram = 1}
> (gdb) step
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000000000000 in ?? ()
> (gdb) 
> 
> 
> The patches appear to have been fully applied, but it would seem that the
> savevm_block_handlers structure needs to be updated to populate this
> field? Which implies that a new function will have to be written?
> 
> Or, if I have missed the obvious, I would appreciate enlightenment.

Simple bug on my part; the line:

        if (se->ops->can_postcopy(se->opaque)) {

needs to become:
        if (se->ops->can_postcopy &&
            se->ops->can_postcopy(se->opaque)) {

Thanks for the report.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] Bug in recent postcopy patch
  2014-10-30 10:03 ` Dr. David Alan Gilbert
@ 2014-10-30 16:49   ` Gary Hook
  2014-10-30 20:08     ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 6+ messages in thread
From: Gary Hook @ 2014-10-30 16:49 UTC (permalink / raw)
  To: qemu-devel@nongnu.org; +Cc: Dr. David Alan Gilbert

On 10/30/14, 5:03 AM, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

>* Gary Hook (gary.hook@nimboxx.com) wrote:
>> *Knock* *knock* *knock* Is this thing on?
>
>Yes - but only by luck did I notice this; it's normally better
>to reply to the thread that posted a patch and cc the authors!

Well, that depends upon the developers, I think. I was gently admonished
on another list for addressing a developer (inadvertently) directly. But I
appreciate your openness, and would not want to abuse your attention.

>> I applied the 47 pieces of the recent postcopy patch to 2.1.2 and am
>> poking around. An attempt to migrate results in a NULL pointer
>>dereference
>> in savevm.c.  Here is info from gdb:
>
>I've not tried migrating with block migration; so can you
>show the command line you used on qemu and the sequence of commands
>you used to trigger the migration?

Yessir.  We invoke the emulator from libvirt. While the problem we are
dealing with applies to any VM, the one I am working with is invoked
thusly (edited for readability):

qemu-system-x86_64 -enable-kvm -name 88dbaf46-4692-4935-bd9d-8d8fac7725a9 \
	-S -machine pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off \
	-smp 1,sockets=1,cores=1,threads=1 \
	-uuid 88dbaf46-4692-4935-bd9d-8d8fac7725a9 -no-user-config -nodefaults \
	-chardev 
socket,id=charmonitor,path=/var/lib/libvirt/qemu/88dbaf46-4692-4935-bd9d-8d
8fac7725a9.monitor,server,nowait \
	-mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime \
	-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
	-drive 
file=/mnt/store01/virt/88dbaf46-4692-4935-bd9d-8d8fac7725a9.qcow2,if=none,i
d=drive-virtio-disk0,format=qcow2,cache=writeback \
	-device 
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virt
io-disk0,bootindex=1 \
	-drive if=none,id=drive-ide0-1-0,readonly=on,format=raw \
	-device 
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=2 \
	-netdev tap,fd=29,id=hostnet0 -device
rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:07:19:5e,bus=pci.0,addr=0x3 \
	-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 \
	-vnc 127.0.0.1:0,password -device VGA,id=video0,bus=pci.0,addr=0x2 \
	-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
	-msg timestamp=on

I posted another thread asking about migration failure due to a copy
taking too long, but got no traction. In the case where the problem raises
its head we have turned tunneling on. A tiny VM (<2GB in size) migrates
fine using the same procedure. Again, no shared storage.



>>Q: why is max_size == 0? Does this seem correct?
>
>Yes, I think that's normal for the 1st time through the loop; (see
>migration_thread
>near the start max_size is initialised to 0).

Thank you; will do.

>> 
>> 
>> The patches appear to have been fully applied, but it would seem that
>>the
>> savevm_block_handlers structure needs to be updated to populate this
>> field? Which implies that a new function will have to be written?
>> 
>> Or, if I have missed the obvious, I would appreciate enlightenment.
>
>Simple bug on my part; the line:
>
>        if (se->ops->can_postcopy(se->opaque)) {
>
>needs to become:
>        if (se->ops->can_postcopy &&
>            se->ops->can_postcopy(se->opaque)) {

I wondered if that were not the case. I will make that change and see what
happens.

>Thanks for the report.

Thank you for your time and ownership.

Gary

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] Bug in recent postcopy patch
  2014-10-30 16:49   ` Gary Hook
@ 2014-10-30 20:08     ` Dr. David Alan Gilbert
  2014-10-30 21:59       ` Gary Hook
  0 siblings, 1 reply; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-30 20:08 UTC (permalink / raw)
  To: Gary Hook; +Cc: qemu-devel@nongnu.org

* Gary Hook (gary.hook@nimboxx.com) wrote:
> On 10/30/14, 5:03 AM, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> >* Gary Hook (gary.hook@nimboxx.com) wrote:
> >> *Knock* *knock* *knock* Is this thing on?
> >
> >Yes - but only by luck did I notice this; it's normally better
> >to reply to the thread that posted a patch and cc the authors!
> 
> Well, that depends upon the developers, I think. I was gently admonished
> on another list for addressing a developer (inadvertently) directly. But I
> appreciate your openness, and would not want to abuse your attention.
> 
> >> I applied the 47 pieces of the recent postcopy patch to 2.1.2 and am
> >> poking around. An attempt to migrate results in a NULL pointer
> >>dereference
> >> in savevm.c.  Here is info from gdb:
> >
> >I've not tried migrating with block migration; so can you
> >show the command line you used on qemu and the sequence of commands
> >you used to trigger the migration?
> 
> Yessir.  We invoke the emulator from libvirt. While the problem we are
> dealing with applies to any VM, the one I am working with is invoked
> thusly (edited for readability):
> 
> qemu-system-x86_64 -enable-kvm -name 88dbaf46-4692-4935-bd9d-8d8fac7725a9 \
> 	-S -machine pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off \
> 	-smp 1,sockets=1,cores=1,threads=1 \
> 	-uuid 88dbaf46-4692-4935-bd9d-8d8fac7725a9 -no-user-config -nodefaults \
> 	-chardev 
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/88dbaf46-4692-4935-bd9d-8d
> 8fac7725a9.monitor,server,nowait \
> 	-mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime \
> 	-no-shutdown -boot strict=on -device
> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
> 	-drive 
> file=/mnt/store01/virt/88dbaf46-4692-4935-bd9d-8d8fac7725a9.qcow2,if=none,i
> d=drive-virtio-disk0,format=qcow2,cache=writeback \
> 	-device 
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virt
> io-disk0,bootindex=1 \
> 	-drive if=none,id=drive-ide0-1-0,readonly=on,format=raw \
> 	-device 
> ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=2 \
> 	-netdev tap,fd=29,id=hostnet0 -device
> rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:07:19:5e,bus=pci.0,addr=0x3 \
> 	-chardev pty,id=charserial0 -device
> isa-serial,chardev=charserial0,id=serial0 \
> 	-vnc 127.0.0.1:0,password -device VGA,id=video0,bus=pci.0,addr=0x2 \
> 	-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
> 	-msg timestamp=on
> 
> I posted another thread asking about migration failure due to a copy
> taking too long, but got no traction. In the case where the problem raises
> its head we have turned tunneling on. A tiny VM (<2GB in size) migrates
> fine using the same procedure. Again, no shared storage.

Is the guest that doesn't migrate idle or is it busily changing lots of memory?

> >>Q: why is max_size == 0? Does this seem correct?
> >
> >Yes, I think that's normal for the 1st time through the loop; (see
> >migration_thread
> >near the start max_size is initialised to 0).
> 
> Thank you; will do.
> 
> >> 
> >> 
> >> The patches appear to have been fully applied, but it would seem that
> >>the
> >> savevm_block_handlers structure needs to be updated to populate this
> >> field? Which implies that a new function will have to be written?
> >> 
> >> Or, if I have missed the obvious, I would appreciate enlightenment.
> >
> >Simple bug on my part; the line:
> >
> >        if (se->ops->can_postcopy(se->opaque)) {
> >
> >needs to become:
> >        if (se->ops->can_postcopy &&
> >            se->ops->can_postcopy(se->opaque)) {
> 
> I wondered if that were not the case. I will make that change and see what
> happens.
> 
> >Thanks for the report.
> 
> Thank you for your time and ownership.

No problem; note the postcopy code is still quite young, so don't
be too surprised if you hit other issues.

Dave

> 
> Gary
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] Bug in recent postcopy patch
  2014-10-30 20:08     ` Dr. David Alan Gilbert
@ 2014-10-30 21:59       ` Gary Hook
  2014-10-31 12:04         ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 6+ messages in thread
From: Gary Hook @ 2014-10-30 21:59 UTC (permalink / raw)
  To: qemu-devel@nongnu.org; +Cc: Dr. David Alan Gilbert

On 10/30/14, 3:08 PM, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:

>>I posted another thread asking about migration failure due to a copy
>> taking too long, but got no traction. In the case where the problem
>>raises
>> its head we have turned tunneling on. A tiny VM (<2GB in size) migrates
>> fine using the same procedure. Again, no shared storage.
>
>Is the guest that doesn't migrate idle or is it busily changing lots of
>memory?

Quite idle.  Boot the VM, no need to start a workload, try to migrate.
Failure.

Also, very large VMs will fail to migrate (non-tunneled). This _seems_ to
also be related to the amount of time required to copy everything from A
to B.

Again, tunneling seems to more quickly expose this issue as it increases
the amount of time required to copy the qcow2 file over the network.

I will add here that I¹ve watched the qcow2 file grow, made a copy of it
(on the receiving end) before it gets deleted, and been able to start a VM
using the file. It would seem to be copasetic.

I need to add tracing code to the emulator, in a way that doesn¹t rely
upon command line options or environment variables. I don¹t see any such
facility at this point. Specifically, I want to begin by watching what is
going through the monitor (I.e. Return values from qemu-system-x86_64 and
why there are complaints.) Unless you have any clear explanation as to why
the emulator is throwing an error, could you suggest any areas I may want
to focus my efforts?

>> 
>> >Thanks for the report.
>> 
>> Thank you for your time and ownership.
>
>No problem; note the postcopy code is still quite young, so don't
>be too surprised if you hit other issues.

Of course; it¹s fresh out of the oven. But the migration of VMs using
non-shared storage is not (tunneled or otherwise), and that¹s really what
I am focused on.

Again, much appreciation.

Gary

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] Bug in recent postcopy patch
  2014-10-30 21:59       ` Gary Hook
@ 2014-10-31 12:04         ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-31 12:04 UTC (permalink / raw)
  To: Gary Hook; +Cc: qemu-devel@nongnu.org

* Gary Hook (gary.hook@nimboxx.com) wrote:
> 
> 
> On 10/30/14, 3:08 PM, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
> 
> >>I posted another thread asking about migration failure due to a copy
> >> taking too long, but got no traction. In the case where the problem
> >>raises
> >> its head we have turned tunneling on. A tiny VM (<2GB in size) migrates
> >> fine using the same procedure. Again, no shared storage.
> >
> >Is the guest that doesn't migrate idle or is it busily changing lots of
> >memory?
> 
> Quite idle.  Boot the VM, no need to start a workload, try to migrate.
> Failure.
> 
> Also, very large VMs will fail to migrate (non-tunneled). This _seems_ to
> also be related to the amount of time required to copy everything from A
> to B.
> 
> Again, tunneling seems to more quickly expose this issue as it increases
> the amount of time required to copy the qcow2 file over the network.
> 
> I will add here that I¹ve watched the qcow2 file grow, made a copy of it
> (on the receiving end) before it gets deleted, and been able to start a VM
> using the file. It would seem to be copasetic.
> 
> I need to add tracing code to the emulator, in a way that doesn¹t rely
> upon command line options or environment variables. I don¹t see any such
> facility at this point. Specifically, I want to begin by watching what is
> going through the monitor (I.e. Return values from qemu-system-x86_64 and
> why there are complaints.) Unless you have any clear explanation as to why
> the emulator is throwing an error, could you suggest any areas I may want
> to focus my efforts?

No I don't, but there again I've not done any block stuff, and it sounds like
your problem is mostly related to moving the image file (which I thought
libvirt preferred to do using NBD underneath now, but again, I'm not a block
guy).

> >> >Thanks for the report.
> >> 
> >> Thank you for your time and ownership.
> >
> >No problem; note the postcopy code is still quite young, so don't
> >be too surprised if you hit other issues.
> 
> Of course; it¹s fresh out of the oven. But the migration of VMs using
> non-shared storage is not (tunneled or otherwise), and that¹s really what
> I am focused on.
> 
> Again, much appreciation.

Dave

> Gary
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2014-10-31 15:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-29 22:27 [Qemu-devel] Bug in recent postcopy patch Gary Hook
2014-10-30 10:03 ` Dr. David Alan Gilbert
2014-10-30 16:49   ` Gary Hook
2014-10-30 20:08     ` Dr. David Alan Gilbert
2014-10-30 21:59       ` Gary Hook
2014-10-31 12:04         ` Dr. David Alan Gilbert

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).