From: Peter Xu <peterx@redhat.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: qemu-devel@nongnu.org, Andrea Arcangeli <aarcange@redhat.com>,
"Daniel P . Berrange" <berrange@redhat.com>,
Juan Quintela <quintela@redhat.com>,
Alexey Perevalov <a.perevalov@samsung.com>
Subject: Re: [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery
Date: Fri, 12 Jan 2018 17:27:28 +0800 [thread overview]
Message-ID: <20180112092728.GP2551@xz-mi> (raw)
In-Reply-To: <20180111165930.GE2669@work-vm>
On Thu, Jan 11, 2018 at 04:59:32PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Tree is pushed here for better reference and testing (online tree
> > includes monitor OOB series):
> >
> > https://github.com/xzpeter/qemu/tree/postcopy-recover-all
> >
> > This version removed quite a few patches related to migrate-incoming,
> > instead I introduced a new command "migrate-recover" to trigger the
> > recovery channel on destination side to simplify the code.
>
> I've got this setup on a couple of my test hosts, and I'm using
> iptables to try breaking the connection.
>
> See below for where I got stuck.
>
> > To test this two series altogether, please checkout above tree and
> > build. Note: to test on small and single host, one need to disable
> > full bandwidth postcopy migration otherwise it'll complete very fast.
> > Basically a simple patch like this would help:
> >
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 4de3b551fe..c0206023d7 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1904,7 +1904,7 @@ static int postcopy_start(MigrationState *ms, bool *old_vm_running)
> > * will notice we're in POSTCOPY_ACTIVE and not actually
> > * wrap their state up here
> > */
> > - qemu_file_set_rate_limit(ms->to_dst_file, INT64_MAX);
> > + // qemu_file_set_rate_limit(ms->to_dst_file, INT64_MAX);
> > if (migrate_postcopy_ram()) {
> > /* Ping just for debugging, helps line traces up */
> > qemu_savevm_send_ping(ms->to_dst_file, 2);
> >
> > This patch is included already in above github tree. Please feel free
> > to drop this patch when want to test on big machines and between real
> > hosts.
> >
> > Detailed Test Procedures (QMP only)
> > ===================================
> >
> > 1. start source QEMU.
> >
> > $qemu -M q35,kernel-irqchip=split -enable-kvm -snapshot \
> > -smp 4 -m 1G -qmp stdio \
> > -name peter-vm,debug-threads=on \
> > -netdev user,id=net0 \
> > -device e1000,netdev=net0 \
> > -global migration.x-max-bandwidth=4096 \
> > -global migration.x-postcopy-ram=on \
> > /images/fedora-25.qcow2
> >
> > 2. start destination QEMU.
> >
> > $qemu -M q35,kernel-irqchip=split -enable-kvm -snapshot \
> > -smp 4 -m 1G -qmp stdio \
> > -name peter-vm,debug-threads=on \
> > -netdev user,id=net0 \
> > -device e1000,netdev=net0 \
> > -global migration.x-max-bandwidth=4096 \
> > -global migration.x-postcopy-ram=on \
> > -incoming tcp:0.0.0.0:5555 \
> > /images/fedora-25.qcow2
>
> I'm using:
> ./x86_64-softmmu/qemu-system-x86_64 -nographic -M pc,accel=kvm -smp 4 -m 16G -drive file=/home/vms/rhel71.qcow2,id=d,cache=none,if=none -device virtio-blk,drive=d -vnc 0:0 -incoming tcp:0:8888 -chardev socket,port=4000,host=0,id=mon,server,nowait,telnet -mon chardev=mon,id=mon,mode=control -nographic -chardev stdio,mux=on,id=monh -mon chardev=monh,mode=readline --device isa-serial,chardev=monh
> and I've got both the HMP on the stdio, and the QMP via a telnet
>
> >
> > 3. On source, do QMP handshake as normal:
> >
> > {"execute": "qmp_capabilities"}
> > {"return": {}}
> >
> > 4. On destination, do QMP handshake to enable OOB:
> >
> > {"execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }
> > {"return": {}}
> >
> > 5. On source, trigger initial migrate command, switch to postcopy:
> >
> > {"execute": "migrate", "arguments": { "uri": "tcp:localhost:5555" } }
> > {"return": {}}
> > {"execute": "query-migrate"}
> > {"return": {"expected-downtime": 300, "status": "active", ...}}
> > {"execute": "migrate-start-postcopy"}
> > {"return": {}}
> > {"timestamp": {"seconds": 1512454728, "microseconds": 768096}, "event": "STOP"}
> > {"execute": "query-migrate"}
> > {"return": {"expected-downtime": 44472, "status": "postcopy-active", ...}}
> >
> > 6. On source, manually trigger a "fake network down" using
> > "migrate-cancel" command:
> >
> > {"execute": "migrate_cancel"}
> > {"return": {}}
>
> Before I do that, I'm breaking the network connection by running on the
> source:
> iptables -A INPUT -p tcp --source-port 8888 -j DROP
> iptables -A INPUT -p tcp --destination-port 8888 -j DROP
This is tricky... I think tcp keepalive may help, but for sure I
think we do need a way to cancel the migration on both side. Please
see below comment.
>
> > During postcopy, it'll not really cancel the migration, but pause
> > it. On both sides, we should see this on stderr:
> >
> > qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> >
> > It means now both sides are in postcopy-pause state.
>
> Now, here we start to have a problem; I do the migrate-cancel on the
> source, that works and goes into pause; but remember the network is
> broken, so the destination hasn't received the news.
>
> > 7. (Optional) On destination side, let's try to hang the main thread
> > using the new x-oob-test command, providing a "lock=true" param:
> >
> > {"execute": "x-oob-test", "id": "lock-dispatcher-cmd",
> > "arguments": { "lock": true } }
> >
> > After sending this command, we should not see any "return", because
> > main thread is blocked already. But we can still use the monitor
> > since the monitor now has dedicated IOThread.
> >
> > 8. On destination side, provide a new incoming port using the new
> > command "migrate-recover" (note that if step 7 is carried out, we
> > _must_ use OOB form, otherwise the command will hang. With OOB,
> > this command will return immediately):
> >
> > {"execute": "migrate-recover", "id": "recover-cmd",
> > "arguments": { "uri": "tcp:localhost:5556" },
> > "control": { "run-oob": true } }
> > {"timestamp": {"seconds": 1512454976, "microseconds": 186053},
> > "event": "MIGRATION", "data": {"status": "setup"}}
> > {"return": {}, "id": "recover-cmd"}
> >
> > We can see that the command will success even if main thread is
> > locked up.
>
> Because the destination didn't get the news of the pause, I get:
> {"id": "recover-cmd", "error": {"class": "GenericError", "desc": "Migrate recover can only be run when postcopy is paused."}}
This is normal since we didn't fail on destination, while...
>
> and I can't explicitly cause a cancel on the destination:
> {"id": "cancel-cmd", "error": {"class": "GenericError", "desc": "The command migrate_cancel does not support OOB"}}
... this is not normal. I have two questions:
1. Have you provided
"control": {"run-oob": true}
field when sending command "migrate_cancel"? Just to mention that
we shouldn't do it in oob way for migrate_cancel. Or it can be a
monitor-oob bug.
2. Do we need to support "migrate_cancel" on destination?
For (2), I think we need it, but for now it only works on source for
sure. So I think maybe I should add that support.
>
> So I think we need a way out of this on the destination.
So that's my 2nd question. How about we do this: migrate_cancel will
cancel incoming migration if:
a. there is one incoming migration in progress, and
b. postcopy is enabled
Thanks,
--
Peter Xu
next prev parent reply other threads:[~2018-01-12 9:27 UTC|newest]
Thread overview: 43+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-12-05 6:52 [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 01/28] migration: better error handling with QEMUFile Peter Xu
2017-12-05 11:40 ` Dr. David Alan Gilbert
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 02/28] migration: reuse mis->userfault_quit_fd Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 03/28] migration: provide postcopy_fault_thread_notify() Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 04/28] migration: new postcopy-pause state Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 05/28] migration: implement "postcopy-pause" src logic Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 06/28] migration: allow dst vm pause on postcopy Peter Xu
2017-12-14 13:10 ` Dr. David Alan Gilbert
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 07/28] migration: allow src return path to pause Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 08/28] migration: allow send_rq to fail Peter Xu
2017-12-14 13:21 ` Dr. David Alan Gilbert
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 09/28] migration: allow fault thread to pause Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 10/28] qmp: hmp: add migrate "resume" option Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 11/28] migration: pass MigrationState to migrate_init() Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 12/28] migration: rebuild channel on source Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 13/28] migration: new state "postcopy-recover" Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 14/28] migration: wakeup dst ram-load-thread for recover Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 15/28] migration: new cmd MIG_CMD_RECV_BITMAP Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 16/28] migration: new message MIG_RP_MSG_RECV_BITMAP Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 17/28] migration: new cmd MIG_CMD_POSTCOPY_RESUME Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 18/28] migration: new message MIG_RP_MSG_RESUME_ACK Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 19/28] migration: introduce SaveVMHandlers.resume_prepare Peter Xu
2017-12-05 6:52 ` [Qemu-devel] [PATCH v5 20/28] migration: synchronize dirty bitmap for resume Peter Xu
2017-12-05 6:53 ` [Qemu-devel] [PATCH v5 21/28] migration: setup ramstate " Peter Xu
2017-12-05 6:53 ` [Qemu-devel] [PATCH v5 22/28] migration: final handshake for the resume Peter Xu
2017-12-05 6:53 ` [Qemu-devel] [PATCH v5 23/28] migration: free SocketAddress where allocated Peter Xu
2017-12-05 6:53 ` [Qemu-devel] [PATCH v5 24/28] migration: init dst in migration_object_init too Peter Xu
2017-12-05 6:53 ` [Qemu-devel] [PATCH v5 25/28] io: let watcher of the channel run in same ctx Peter Xu
2017-12-05 6:53 ` [Qemu-devel] [PATCH v5 26/28] migration: allow migrate_cancel to pause postcopy Peter Xu
2017-12-19 10:58 ` Dr. David Alan Gilbert
2018-01-24 8:28 ` Peter Xu
2018-01-24 9:06 ` Dr. David Alan Gilbert
2017-12-05 6:53 ` [Qemu-devel] [PATCH v5 27/28] qmp/migration: new command migrate-recover Peter Xu
2017-12-05 6:53 ` [Qemu-devel] [PATCH v5 28/28] hmp/migration: add migrate_recover command Peter Xu
2017-12-05 6:55 ` [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery Peter Xu
2017-12-05 18:43 ` Dr. David Alan Gilbert
2017-12-06 2:39 ` Peter Xu
2018-01-11 16:59 ` Dr. David Alan Gilbert
2018-01-12 9:27 ` Peter Xu [this message]
2018-01-12 12:27 ` Dr. David Alan Gilbert
2018-01-24 6:19 ` Peter Xu
2018-01-24 9:05 ` Dr. David Alan Gilbert
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180112092728.GP2551@xz-mi \
--to=peterx@redhat.com \
--cc=a.perevalov@samsung.com \
--cc=aarcange@redhat.com \
--cc=berrange@redhat.com \
--cc=dgilbert@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).