qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Fei Li <fli@suse.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: qemu-devel@nongnu.org, quintela@redhat.com, peterx@redhat.com
Subject: Re: [Qemu-devel] [PATCH RFC 0/2] Fix migration issues
Date: Fri, 26 Oct 2018 20:59:26 +0800	[thread overview]
Message-ID: <670ce3cc-2146-ad2b-197e-8b18daf49e79@suse.com> (raw)
In-Reply-To: <20181025125501.GA5912@work-vm>



On 10/25/2018 08:55 PM, Dr. David Alan Gilbert wrote:
> * Fei Li (fli@suse.com) wrote:
>> Hi,
>> these two patches are to fix live migration issues. The first is
>> about multifd, and the second is to fix some error handling.
>>
>> But I have a question about using multifd migration.
>> In our current code, when multifd is used during migration, if there
>> is an error before the destination receives all new channels (I mean
>> multifd_recv_new_channel(ioc)), the destination does not exit but
>> keeps waiting (Hang in recvmsg() in qio_channel_socket_readv) until
>> the source exits.
>>
>> My question is about the state of the destination host if fails during
>> this period. I did a test, after applying [1/2] patch, if
>> multifd_new_send_channel_async() fails, the destination host hangs for
>> a while then later pops up a window saying
>>      "'QEMU (...) [stopped]' is not responding.
>>      You may choose to wait a short while for it to continue or force
>>      the application to quit entirely."
>> But after closing the window by clicking, the qemu on the dest still
>> hangs there until I exclusively kill the qemu on the source.
> That sounds like the main thread is blocked for some reason?
Yes, the main thread on  the dst is keeps looping.
> But I don't
> normally use the window setup;  if you try with -nographic and can see
> the HMP (or a QMP) monitor, can you see if the monitor still responds?

Thanks for the `-nographic` reminder, I harvested an interesting 
phenonmenon:
If I do the `migrate -d tcp:ip_addr:port` before the guest's graphic appears
(it's dark now), there is no hang and the guest starts up properly later.
But if I do the live migration after the guest fully starts up, I mean when
I can operate something using my mouse inside the guest, the hang
situation is there.
This is true for using `-nographic` for both src and dst,
and using `-nographic` for only src or dst.


The hang phenonmenon is that the dst seems never responds (I
waited three minutes), and the cursor just keeps flashing. After I
exclusively kill the src, then the dst quit. Just as follows:
(Same result if gdb is not used in src)
src:
(qemu) ...
(qemu) q
(gdb) q
dst:
(qemu) Up to now, dst has received the 0 channel
Up to now, dst has received the 1 channel

(qemu)
(qemu)

To check the migtation state in the src:
(qemu) info migrate
globals:
store-global-state: on
only-migratable: off
send-configuration: on
send-section-footer: on
decompress-error-check: on
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off 
zero-blocks: off compress: off events: off postcopy-ram: off x-colo: off 
release-ram: off block: off return-path: off pause-before-switchover: 
off x-multifd: on dirty-bitmaps: off postcopy-blocktime: off 
late-block-activate: off
Migration status: setup /* I added some codes to set the status to 
"failed", but still not working, details see below */
total time: 0 milliseconds

I guess maybe the source should to proactive to tell the dst and
disconnects from the source side, so I tried to set the above
"Migration status" to be "failed", and use qemu_fclose(s->to_dst_file)
when multifd_new_send_channel_async() fails.
(BTW: I even tried:
  if (s->vm_was_running) {   vm_start();   }   )
But the hang situation is still there.
> If it doesn't then try and get a backtrace.
>
> The monitor really shouldn't block, so it would be interesting to see.
>
> Dave
I set two breakpoints and get the following backtrace, hope they can 
help. :)

Thread 1 "qemu-system-x86" hit Breakpoint 1, multifd_recv_new_channel (
     ioc=0x555557995af0) at /build/gitcode/qemu-build/migration/ram.c:1368
1368    {
(gdb) c
Continuing.

Thread 1 "qemu-system-x86" hit Breakpoint 2, qio_channel_socket_readv (
     ioc=0x555557995af0, iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0,
     errp=0x7fffffffdb38) at io/channel-socket.c:463
463    {
(gdb) n
464        QIOChannelSocket *sioc = QIO_CHANNEL_SOCKET(ioc);
(gdb)
......
483     retry:
(gdb)
484        ret = recvmsg(sioc->fd, &msg, sflags);
(gdb) bt
#0  qio_channel_socket_readv (ioc=0x555557995af0, iov=0x5555568777d0, 
niov=1,
     fds=0x0, nfds=0x0, errp=0x7fffffffdb38) at io/channel-socket.c:484
#1  0x0000555555d156c5 in qio_channel_readv_full (ioc=0x555557995af0,
     iov=0x5555568777d0, niov=1, fds=0x0, nfds=0x0, errp=0x7fffffffdb38)
     at io/channel.c:65
#2  0x0000555555d15b26 in qio_channel_readv (ioc=0x555557995af0,
     iov=0x5555568777d0, niov=1, errp=0x7fffffffdb38) at io/channel.c:197
#3  0x0000555555d15853 in qio_channel_readv_all_eof (ioc=0x555557995af0,
     iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:106
#4  0x0000555555d1595c in qio_channel_readv_all (ioc=0x555557995af0,
     iov=0x7fffffffda70, niov=1, errp=0x7fffffffdb38) at io/channel.c:142
#5  0x0000555555d15d0c in qio_channel_read_all (ioc=0x555557995af0,
     buf=0x7fffffffdad0 "\340\"zVUU", buflen=25, errp=0x7fffffffdb38)
     at io/channel.c:246
#6  0x000055555587695c in multifd_recv_initial_packet (c=0x555557995af0,
     errp=0x7fffffffdb38) at /build/gitcode/qemu-build/migration/ram.c:653
#7  0x00005555558788fb in multifd_recv_new_channel (ioc=0x555557995af0)
     at /build/gitcode/qemu-build/migration/ram.c:1374
#8  0x0000555555bc9978 in migration_ioc_process_incoming 
(ioc=0x555557995af0)
     at migration/migration.c:573
#9  0x0000555555bd0c69 in migration_channel_process_incoming 
(ioc=0x555557995af0)
     at migration/channel.c:47
#10 0x0000555555bcf7e8 in socket_accept_incoming_migration (
     listener=0x5555578dcae0, cioc=0x555557995af0, opaque=0x0)
     at migration/socket.c:166
#11 0x0000555555d2051f in qio_net_listener_channel_func 
(ioc=0x5555579c7180,
     condition=G_IO_IN, opaque=0x5555578dcae0) at io/net-listener.c:53
#12 0x0000555555d1c0a2 in qio_channel_fd_source_dispatch 
(source=0x5555568d5970,
---Type <return> to continue, or q <return> to quit---
     callback=0x555555d20473 <qio_net_listener_channel_func>,
     user_data=0x5555578dcae0) at io/channel-watch.c:84
#13 0x00007ffff6353dc5 in g_main_context_dispatch ()
    from /usr/lib64/libglib-2.0.so.0
#14 0x0000555555d7d1ad in glib_pollfds_poll () at util/main-loop.c:215
#15 0x0000555555d7d227 in os_host_main_loop_wait (timeout=0) at 
util/main-loop.c:238
#16 0x0000555555d7d2e0 in main_loop_wait (nonblocking=0) at 
util/main-loop.c:497
#17 0x00005555559cd679 in main_loop () at vl.c:1884
#18 0x00005555559d4f1e in main (argc=32, argv=0x7fffffffe0b8, 
envp=0x7fffffffe1c0)
     at vl.c:4618
(gdb) n

Thread 1 "qemu-system-x86" received signal SIGINT, Interrupt.
0x00007ffff5606f64 in recvmsg () from /lib64/libpthread.so.0
(gdb) c
Continuing.

After I input above `n`, the dst just hangs here, seems waiting for the 
result of
recvmsg(sioc->fd, &msg, sflags); Later even I use ctrl+c to kill it, the 
dst still hangs.

Have a nice day, thanks
Fei
>
>> The source host keeps running as expected, but I guess the hang
>> phenonmenon in the dest is not right.
>> Would someone kindly give some suggestions on this? Thanks a lot.
>>
>>
>> Fei Li (2):
>>    migration: fix the multifd code
>>    migration: fix some error handling
>>
>>   migration/migration.c    |  5 +----
>>   migration/postcopy-ram.c |  3 +++
>>   migration/ram.c          | 33 +++++++++++++++++++++++----------
>>   migration/ram.h          |  2 +-
>>   4 files changed, 28 insertions(+), 15 deletions(-)
>>
>> -- 
>> 2.13.7
>>
> --
> Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
>

      reply	other threads:[~2018-10-26 12:59 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-22 11:08 [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Fei Li
2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 1/2] migration: fix the multifd code Fei Li
2018-10-22 11:08 ` [Qemu-devel] [PATCH RFC 2/2] migration: fix some error handling Fei Li
2018-10-24 21:27 ` [Qemu-devel] [PATCH RFC 0/2] Fix migration issues Peter Xu
2018-10-25  9:04   ` Fei Li
2018-10-25 12:58     ` Peter Xu
2018-10-26 13:10       ` Fei Li
2018-10-26 13:35         ` Peter Xu
2018-10-26 15:24           ` Dr. David Alan Gilbert
2018-10-29  7:15             ` Fei Li
2018-10-25 12:55 ` Dr. David Alan Gilbert
2018-10-26 12:59   ` Fei Li [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=670ce3cc-2146-ad2b-197e-8b18daf49e79@suse.com \
    --to=fli@suse.com \
    --cc=dgilbert@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).