* [Qemu-devel] Bug in recent postcopy patch
@ 2014-10-29 22:27 Gary Hook
2014-10-30 10:03 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 6+ messages in thread
From: Gary Hook @ 2014-10-29 22:27 UTC (permalink / raw)
To: qemu-devel@nongnu.org
*Knock* *knock* *knock* Is this thing on?
I applied the 47 pieces of the recent postcopy patch to 2.1.2 and am
poking around. An attempt to migrate results in a NULL pointer dereference
in savevm.c. Here is info from gdb:
Most of qemu_savevm_state_pending() succeeds, until it gets to the end.
Here¹s the relevant thread while calling is_active():
(gdb) backtrace
#0 block_is_active (opaque=0x7fb0ae721200 <block_mig_state>) at
block-migration.c:860
#1 0x00007fb0adf4a13a in qemu_savevm_state_pending (f=0x7fb0b01e3a40,
max_size=max_size@entry=0,
res_non_postcopiable=res_non_postcopiable@entry=0x7fb09d604c90,
res_postcopiable=res_postcopiable@entry=0x7fb09d604c88)
at /home/hook/src/qemu/postcopy2/savevm.c:983
#2 0x00007fb0ae01bd82 in migration_thread (opaque=0x7fb0ae684420
<current_migration>) at migration.c:1185
#3 0x00007fb0a824d182 in start_thread (arg=0x7fb09d605700) at
pthread_create.c:312
#4 0x00007fb0a7f79fbd in clone () at
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Q: why is max_size == 0? Does this seem correct?
We look at se->ops:
(gdb) print *se->ops
$9 = {set_params = 0x7fb0ae028820 <block_set_params>, save_state = 0x0,
cancel = 0x7fb0ae028f50 <block_migration_cancel>,
save_live_complete = 0x7fb0ae0299a0 <block_save_complete>, is_active =
0x7fb0ae028870 <block_is_active>,
save_live_iterate = 0x7fb0ae029480 <block_save_iterate>, save_live_setup
= 0x7fb0ae029330 <block_save_setup>,
save_live_pending = 0x7fb0ae028b30 <block_save_pending>, can_postcopy =
0x0, load_state = 0x7fb0ae0288b0 <block_load>}
Why is can_postcopy() NULL?
(gdb) n
qemu_savevm_state_pending (f=0x7fb0b01e3a40, max_size=max_size@entry=0,
res_non_postcopiable=res_non_postcopiable@entry=0x7fb09d604c90,
res_postcopiable=res_postcopiable@entry=0x7fb09d604c88) at
/home/hook/src/qemu/postcopy2/savevm.c:989
989 if (se->ops->can_postcopy(se->opaque)) {
(gdb) print *se
$14 = {entry = {tqe_next = 0x7fb0aff9ab30, tqe_prev = 0x7fb0aff88f20},
idstr = "block", '\000' <repeats 250 times>, instance_id = 0,
alias_id = 0, version_id = 1, section_id = 1, ops = 0x7fb0ae6848e0
<savevm_block_handlers>, vmsd = 0x0,
opaque = 0x7fb0ae721200 <block_mig_state>, compat = 0x0, is_ram = 1}
(gdb) step
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb)
The patches appear to have been fully applied, but it would seem that the
savevm_block_handlers structure needs to be updated to populate this
field? Which implies that a new function will have to be written?
Or, if I have missed the obvious, I would appreciate enlightenment.
Thanks,
Gary
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Bug in recent postcopy patch
2014-10-29 22:27 [Qemu-devel] Bug in recent postcopy patch Gary Hook
@ 2014-10-30 10:03 ` Dr. David Alan Gilbert
2014-10-30 16:49 ` Gary Hook
0 siblings, 1 reply; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-30 10:03 UTC (permalink / raw)
To: Gary Hook; +Cc: qemu-devel@nongnu.org
* Gary Hook (gary.hook@nimboxx.com) wrote:
> *Knock* *knock* *knock* Is this thing on?
Yes - but only by luck did I notice this; it's normally better
to reply to the thread that posted a patch and cc the authors!
> I applied the 47 pieces of the recent postcopy patch to 2.1.2 and am
> poking around. An attempt to migrate results in a NULL pointer dereference
> in savevm.c. Here is info from gdb:
I've not tried migrating with block migration; so can you
show the command line you used on qemu and the sequence of commands
you used to trigger the migration?
> Most of qemu_savevm_state_pending() succeeds, until it gets to the end.
> Here¹s the relevant thread while calling is_active():
>
> (gdb) backtrace
> #0 block_is_active (opaque=0x7fb0ae721200 <block_mig_state>) at
> block-migration.c:860
> #1 0x00007fb0adf4a13a in qemu_savevm_state_pending (f=0x7fb0b01e3a40,
> max_size=max_size@entry=0,
> res_non_postcopiable=res_non_postcopiable@entry=0x7fb09d604c90,
> res_postcopiable=res_postcopiable@entry=0x7fb09d604c88)
> at /home/hook/src/qemu/postcopy2/savevm.c:983
> #2 0x00007fb0ae01bd82 in migration_thread (opaque=0x7fb0ae684420
> <current_migration>) at migration.c:1185
> #3 0x00007fb0a824d182 in start_thread (arg=0x7fb09d605700) at
> pthread_create.c:312
> #4 0x00007fb0a7f79fbd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
>
> Q: why is max_size == 0? Does this seem correct?
Yes, I think that's normal for the 1st time through the loop; (see migration_thread
near the start max_size is initialised to 0).
> We look at se->ops:
>
> (gdb) print *se->ops
> $9 = {set_params = 0x7fb0ae028820 <block_set_params>, save_state = 0x0,
> cancel = 0x7fb0ae028f50 <block_migration_cancel>,
> save_live_complete = 0x7fb0ae0299a0 <block_save_complete>, is_active =
> 0x7fb0ae028870 <block_is_active>,
> save_live_iterate = 0x7fb0ae029480 <block_save_iterate>, save_live_setup
> = 0x7fb0ae029330 <block_save_setup>,
> save_live_pending = 0x7fb0ae028b30 <block_save_pending>, can_postcopy =
> 0x0, load_state = 0x7fb0ae0288b0 <block_load>}
>
> Why is can_postcopy() NULL?
>
> (gdb) n
> qemu_savevm_state_pending (f=0x7fb0b01e3a40, max_size=max_size@entry=0,
> res_non_postcopiable=res_non_postcopiable@entry=0x7fb09d604c90,
> res_postcopiable=res_postcopiable@entry=0x7fb09d604c88) at
> /home/hook/src/qemu/postcopy2/savevm.c:989
> 989 if (se->ops->can_postcopy(se->opaque)) {
> (gdb) print *se
> $14 = {entry = {tqe_next = 0x7fb0aff9ab30, tqe_prev = 0x7fb0aff88f20},
> idstr = "block", '\000' <repeats 250 times>, instance_id = 0,
> alias_id = 0, version_id = 1, section_id = 1, ops = 0x7fb0ae6848e0
> <savevm_block_handlers>, vmsd = 0x0,
> opaque = 0x7fb0ae721200 <block_mig_state>, compat = 0x0, is_ram = 1}
> (gdb) step
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x0000000000000000 in ?? ()
> (gdb)
>
>
> The patches appear to have been fully applied, but it would seem that the
> savevm_block_handlers structure needs to be updated to populate this
> field? Which implies that a new function will have to be written?
>
> Or, if I have missed the obvious, I would appreciate enlightenment.
Simple bug on my part; the line:
if (se->ops->can_postcopy(se->opaque)) {
needs to become:
if (se->ops->can_postcopy &&
se->ops->can_postcopy(se->opaque)) {
Thanks for the report.
Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Bug in recent postcopy patch
2014-10-30 10:03 ` Dr. David Alan Gilbert
@ 2014-10-30 16:49 ` Gary Hook
2014-10-30 20:08 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 6+ messages in thread
From: Gary Hook @ 2014-10-30 16:49 UTC (permalink / raw)
To: qemu-devel@nongnu.org; +Cc: Dr. David Alan Gilbert
On 10/30/14, 5:03 AM, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>* Gary Hook (gary.hook@nimboxx.com) wrote:
>> *Knock* *knock* *knock* Is this thing on?
>
>Yes - but only by luck did I notice this; it's normally better
>to reply to the thread that posted a patch and cc the authors!
Well, that depends upon the developers, I think. I was gently admonished
on another list for addressing a developer (inadvertently) directly. But I
appreciate your openness, and would not want to abuse your attention.
>> I applied the 47 pieces of the recent postcopy patch to 2.1.2 and am
>> poking around. An attempt to migrate results in a NULL pointer
>>dereference
>> in savevm.c. Here is info from gdb:
>
>I've not tried migrating with block migration; so can you
>show the command line you used on qemu and the sequence of commands
>you used to trigger the migration?
Yessir. We invoke the emulator from libvirt. While the problem we are
dealing with applies to any VM, the one I am working with is invoked
thusly (edited for readability):
qemu-system-x86_64 -enable-kvm -name 88dbaf46-4692-4935-bd9d-8d8fac7725a9 \
-S -machine pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off \
-smp 1,sockets=1,cores=1,threads=1 \
-uuid 88dbaf46-4692-4935-bd9d-8d8fac7725a9 -no-user-config -nodefaults \
-chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/88dbaf46-4692-4935-bd9d-8d
8fac7725a9.monitor,server,nowait \
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime \
-no-shutdown -boot strict=on -device
piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-drive
file=/mnt/store01/virt/88dbaf46-4692-4935-bd9d-8d8fac7725a9.qcow2,if=none,i
d=drive-virtio-disk0,format=qcow2,cache=writeback \
-device
virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virt
io-disk0,bootindex=1 \
-drive if=none,id=drive-ide0-1-0,readonly=on,format=raw \
-device
ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=2 \
-netdev tap,fd=29,id=hostnet0 -device
rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:07:19:5e,bus=pci.0,addr=0x3 \
-chardev pty,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 \
-vnc 127.0.0.1:0,password -device VGA,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
-msg timestamp=on
I posted another thread asking about migration failure due to a copy
taking too long, but got no traction. In the case where the problem raises
its head we have turned tunneling on. A tiny VM (<2GB in size) migrates
fine using the same procedure. Again, no shared storage.
>>Q: why is max_size == 0? Does this seem correct?
>
>Yes, I think that's normal for the 1st time through the loop; (see
>migration_thread
>near the start max_size is initialised to 0).
Thank you; will do.
>>
>>
>> The patches appear to have been fully applied, but it would seem that
>>the
>> savevm_block_handlers structure needs to be updated to populate this
>> field? Which implies that a new function will have to be written?
>>
>> Or, if I have missed the obvious, I would appreciate enlightenment.
>
>Simple bug on my part; the line:
>
> if (se->ops->can_postcopy(se->opaque)) {
>
>needs to become:
> if (se->ops->can_postcopy &&
> se->ops->can_postcopy(se->opaque)) {
I wondered if that were not the case. I will make that change and see what
happens.
>Thanks for the report.
Thank you for your time and ownership.
Gary
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Bug in recent postcopy patch
2014-10-30 16:49 ` Gary Hook
@ 2014-10-30 20:08 ` Dr. David Alan Gilbert
2014-10-30 21:59 ` Gary Hook
0 siblings, 1 reply; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-30 20:08 UTC (permalink / raw)
To: Gary Hook; +Cc: qemu-devel@nongnu.org
* Gary Hook (gary.hook@nimboxx.com) wrote:
> On 10/30/14, 5:03 AM, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>
> >* Gary Hook (gary.hook@nimboxx.com) wrote:
> >> *Knock* *knock* *knock* Is this thing on?
> >
> >Yes - but only by luck did I notice this; it's normally better
> >to reply to the thread that posted a patch and cc the authors!
>
> Well, that depends upon the developers, I think. I was gently admonished
> on another list for addressing a developer (inadvertently) directly. But I
> appreciate your openness, and would not want to abuse your attention.
>
> >> I applied the 47 pieces of the recent postcopy patch to 2.1.2 and am
> >> poking around. An attempt to migrate results in a NULL pointer
> >>dereference
> >> in savevm.c. Here is info from gdb:
> >
> >I've not tried migrating with block migration; so can you
> >show the command line you used on qemu and the sequence of commands
> >you used to trigger the migration?
>
> Yessir. We invoke the emulator from libvirt. While the problem we are
> dealing with applies to any VM, the one I am working with is invoked
> thusly (edited for readability):
>
> qemu-system-x86_64 -enable-kvm -name 88dbaf46-4692-4935-bd9d-8d8fac7725a9 \
> -S -machine pc-0.14,accel=kvm,usb=off -m 1024 -realtime mlock=off \
> -smp 1,sockets=1,cores=1,threads=1 \
> -uuid 88dbaf46-4692-4935-bd9d-8d8fac7725a9 -no-user-config -nodefaults \
> -chardev
> socket,id=charmonitor,path=/var/lib/libvirt/qemu/88dbaf46-4692-4935-bd9d-8d
> 8fac7725a9.monitor,server,nowait \
> -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime \
> -no-shutdown -boot strict=on -device
> piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
> -drive
> file=/mnt/store01/virt/88dbaf46-4692-4935-bd9d-8d8fac7725a9.qcow2,if=none,i
> d=drive-virtio-disk0,format=qcow2,cache=writeback \
> -device
> virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virt
> io-disk0,bootindex=1 \
> -drive if=none,id=drive-ide0-1-0,readonly=on,format=raw \
> -device
> ide-cd,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0,bootindex=2 \
> -netdev tap,fd=29,id=hostnet0 -device
> rtl8139,netdev=hostnet0,id=net0,mac=52:54:00:07:19:5e,bus=pci.0,addr=0x3 \
> -chardev pty,id=charserial0 -device
> isa-serial,chardev=charserial0,id=serial0 \
> -vnc 127.0.0.1:0,password -device VGA,id=video0,bus=pci.0,addr=0x2 \
> -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5 \
> -msg timestamp=on
>
> I posted another thread asking about migration failure due to a copy
> taking too long, but got no traction. In the case where the problem raises
> its head we have turned tunneling on. A tiny VM (<2GB in size) migrates
> fine using the same procedure. Again, no shared storage.
Is the guest that doesn't migrate idle or is it busily changing lots of memory?
> >>Q: why is max_size == 0? Does this seem correct?
> >
> >Yes, I think that's normal for the 1st time through the loop; (see
> >migration_thread
> >near the start max_size is initialised to 0).
>
> Thank you; will do.
>
> >>
> >>
> >> The patches appear to have been fully applied, but it would seem that
> >>the
> >> savevm_block_handlers structure needs to be updated to populate this
> >> field? Which implies that a new function will have to be written?
> >>
> >> Or, if I have missed the obvious, I would appreciate enlightenment.
> >
> >Simple bug on my part; the line:
> >
> > if (se->ops->can_postcopy(se->opaque)) {
> >
> >needs to become:
> > if (se->ops->can_postcopy &&
> > se->ops->can_postcopy(se->opaque)) {
>
> I wondered if that were not the case. I will make that change and see what
> happens.
>
> >Thanks for the report.
>
> Thank you for your time and ownership.
No problem; note the postcopy code is still quite young, so don't
be too surprised if you hit other issues.
Dave
>
> Gary
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Bug in recent postcopy patch
2014-10-30 20:08 ` Dr. David Alan Gilbert
@ 2014-10-30 21:59 ` Gary Hook
2014-10-31 12:04 ` Dr. David Alan Gilbert
0 siblings, 1 reply; 6+ messages in thread
From: Gary Hook @ 2014-10-30 21:59 UTC (permalink / raw)
To: qemu-devel@nongnu.org; +Cc: Dr. David Alan Gilbert
On 10/30/14, 3:08 PM, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>>I posted another thread asking about migration failure due to a copy
>> taking too long, but got no traction. In the case where the problem
>>raises
>> its head we have turned tunneling on. A tiny VM (<2GB in size) migrates
>> fine using the same procedure. Again, no shared storage.
>
>Is the guest that doesn't migrate idle or is it busily changing lots of
>memory?
Quite idle. Boot the VM, no need to start a workload, try to migrate.
Failure.
Also, very large VMs will fail to migrate (non-tunneled). This _seems_ to
also be related to the amount of time required to copy everything from A
to B.
Again, tunneling seems to more quickly expose this issue as it increases
the amount of time required to copy the qcow2 file over the network.
I will add here that I¹ve watched the qcow2 file grow, made a copy of it
(on the receiving end) before it gets deleted, and been able to start a VM
using the file. It would seem to be copasetic.
I need to add tracing code to the emulator, in a way that doesn¹t rely
upon command line options or environment variables. I don¹t see any such
facility at this point. Specifically, I want to begin by watching what is
going through the monitor (I.e. Return values from qemu-system-x86_64 and
why there are complaints.) Unless you have any clear explanation as to why
the emulator is throwing an error, could you suggest any areas I may want
to focus my efforts?
>>
>> >Thanks for the report.
>>
>> Thank you for your time and ownership.
>
>No problem; note the postcopy code is still quite young, so don't
>be too surprised if you hit other issues.
Of course; it¹s fresh out of the oven. But the migration of VMs using
non-shared storage is not (tunneled or otherwise), and that¹s really what
I am focused on.
Again, much appreciation.
Gary
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [Qemu-devel] Bug in recent postcopy patch
2014-10-30 21:59 ` Gary Hook
@ 2014-10-31 12:04 ` Dr. David Alan Gilbert
0 siblings, 0 replies; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2014-10-31 12:04 UTC (permalink / raw)
To: Gary Hook; +Cc: qemu-devel@nongnu.org
* Gary Hook (gary.hook@nimboxx.com) wrote:
>
>
> On 10/30/14, 3:08 PM, "Dr. David Alan Gilbert" <dgilbert@redhat.com> wrote:
>
> >>I posted another thread asking about migration failure due to a copy
> >> taking too long, but got no traction. In the case where the problem
> >>raises
> >> its head we have turned tunneling on. A tiny VM (<2GB in size) migrates
> >> fine using the same procedure. Again, no shared storage.
> >
> >Is the guest that doesn't migrate idle or is it busily changing lots of
> >memory?
>
> Quite idle. Boot the VM, no need to start a workload, try to migrate.
> Failure.
>
> Also, very large VMs will fail to migrate (non-tunneled). This _seems_ to
> also be related to the amount of time required to copy everything from A
> to B.
>
> Again, tunneling seems to more quickly expose this issue as it increases
> the amount of time required to copy the qcow2 file over the network.
>
> I will add here that I¹ve watched the qcow2 file grow, made a copy of it
> (on the receiving end) before it gets deleted, and been able to start a VM
> using the file. It would seem to be copasetic.
>
> I need to add tracing code to the emulator, in a way that doesn¹t rely
> upon command line options or environment variables. I don¹t see any such
> facility at this point. Specifically, I want to begin by watching what is
> going through the monitor (I.e. Return values from qemu-system-x86_64 and
> why there are complaints.) Unless you have any clear explanation as to why
> the emulator is throwing an error, could you suggest any areas I may want
> to focus my efforts?
No I don't, but there again I've not done any block stuff, and it sounds like
your problem is mostly related to moving the image file (which I thought
libvirt preferred to do using NBD underneath now, but again, I'm not a block
guy).
> >> >Thanks for the report.
> >>
> >> Thank you for your time and ownership.
> >
> >No problem; note the postcopy code is still quite young, so don't
> >be too surprised if you hit other issues.
>
> Of course; it¹s fresh out of the oven. But the migration of VMs using
> non-shared storage is not (tunneled or otherwise), and that¹s really what
> I am focused on.
>
> Again, much appreciation.
Dave
> Gary
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2014-10-31 15:39 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-29 22:27 [Qemu-devel] Bug in recent postcopy patch Gary Hook
2014-10-30 10:03 ` Dr. David Alan Gilbert
2014-10-30 16:49 ` Gary Hook
2014-10-30 20:08 ` Dr. David Alan Gilbert
2014-10-30 21:59 ` Gary Hook
2014-10-31 12:04 ` Dr. David Alan Gilbert
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).