All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: qemu-devel@nongnu.org, Andrea Arcangeli <aarcange@redhat.com>,
	"Daniel P . Berrange" <berrange@redhat.com>,
	Juan Quintela <quintela@redhat.com>,
	Alexey Perevalov <a.perevalov@samsung.com>
Subject: Re: [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery
Date: Wed, 6 Dec 2017 10:39:45 +0800	[thread overview]
Message-ID: <20171206023945.GC2797@xz-mi> (raw)
In-Reply-To: <20171205184341.GF2405@work-vm>

On Tue, Dec 05, 2017 at 06:43:42PM +0000, Dr. David Alan Gilbert wrote:
> * Peter Xu (peterx@redhat.com) wrote:
> > Tree is pushed here for better reference and testing (online tree
> > includes monitor OOB series):
> > 
> >   https://github.com/xzpeter/qemu/tree/postcopy-recover-all
> > 
> > This version removed quite a few patches related to migrate-incoming,
> > instead I introduced a new command "migrate-recover" to trigger the
> > recovery channel on destination side to simplify the code.
> > 
> > To test this two series altogether, please checkout above tree and
> > build.  Note: to test on small and single host, one need to disable
> > full bandwidth postcopy migration otherwise it'll complete very fast.
> > Basically a simple patch like this would help:
> > 
> > diff --git a/migration/migration.c b/migration/migration.c
> > index 4de3b551fe..c0206023d7 100644
> > --- a/migration/migration.c
> > +++ b/migration/migration.c
> > @@ -1904,7 +1904,7 @@ static int postcopy_start(MigrationState *ms, bool *old_vm_running)
> >       * will notice we're in POSTCOPY_ACTIVE and not actually
> >       * wrap their state up here
> >       */
> > -    qemu_file_set_rate_limit(ms->to_dst_file, INT64_MAX);
> > +    // qemu_file_set_rate_limit(ms->to_dst_file, INT64_MAX);
> >      if (migrate_postcopy_ram()) {
> >          /* Ping just for debugging, helps line traces up */
> >          qemu_savevm_send_ping(ms->to_dst_file, 2);
> > 
> > This patch is included already in above github tree.  Please feel free
> > to drop this patch when want to test on big machines and between real
> > hosts.
> > 
> > Detailed Test Procedures (QMP only)
> > ===================================
> > 
> > 1. start source QEMU.
> > 
> > $qemu -M q35,kernel-irqchip=split -enable-kvm -snapshot \
> >      -smp 4 -m 1G -qmp stdio \
> >      -name peter-vm,debug-threads=on \
> >      -netdev user,id=net0 \
> >      -device e1000,netdev=net0 \
> >      -global migration.x-max-bandwidth=4096 \
> >      -global migration.x-postcopy-ram=on \
> >      /images/fedora-25.qcow2
> 
> I suspect -snapshot isn't doing the right thing to the storage when
> combined with the migration - I'm assuming the destination isn't using
> the same temporary file.
> (Also any reason for specifying split irqchip?)

Ah yes.  Sorry we should not use "-snapshot" here.  Please remove it.

I think my smoke test just didn't try to fetch anything on that temp
storage so nothing went wrong.

And, no reason for split irqchip - I just fetched this command line
somewhere where I was testing IOMMUs. :-) Please feel free to remove
it too if you want.

(so basically I was just pasting my smoke test command lines, not
 really command line required to run the tests)

> 
> > 2. start destination QEMU.
> > 
> > $qemu -M q35,kernel-irqchip=split -enable-kvm -snapshot \
> >      -smp 4 -m 1G -qmp stdio \
> >      -name peter-vm,debug-threads=on \
> >      -netdev user,id=net0 \
> >      -device e1000,netdev=net0 \
> >      -global migration.x-max-bandwidth=4096 \
> >      -global migration.x-postcopy-ram=on \
> >      -incoming tcp:0.0.0.0:5555 \
> >      /images/fedora-25.qcow2
> > 
> > 3. On source, do QMP handshake as normal:
> > 
> >   {"execute": "qmp_capabilities"}
> >   {"return": {}}
> > 
> > 4. On destination, do QMP handshake to enable OOB:
> > 
> >   {"execute": "qmp_capabilities", "arguments": { "enable": [ "oob" ] } }
> >   {"return": {}}
> > 
> > 5. On source, trigger initial migrate command, switch to postcopy:
> > 
> >   {"execute": "migrate", "arguments": { "uri": "tcp:localhost:5555" } }
> >   {"return": {}}
> >   {"execute": "query-migrate"}
> >   {"return": {"expected-downtime": 300, "status": "active", ...}}
> >   {"execute": "migrate-start-postcopy"}
> >   {"return": {}}
> >   {"timestamp": {"seconds": 1512454728, "microseconds": 768096}, "event": "STOP"}
> >   {"execute": "query-migrate"}
> >   {"return": {"expected-downtime": 44472, "status": "postcopy-active", ...}}
> > 
> > 6. On source, manually trigger a "fake network down" using
> >    "migrate-cancel" command:
> > 
> >   {"execute": "migrate_cancel"}
> >   {"return": {}}
> > 
> >   During postcopy, it'll not really cancel the migration, but pause
> >   it.  On both sides, we should see this on stderr:
> > 
> >   qemu-system-x86_64: Detected IO failure for postcopy. Migration paused.
> > 
> >   It means now both sides are in postcopy-pause state.
> > 
> > 7. (Optional) On destination side, let's try to hang the main thread
> >    using the new x-oob-test command, providing a "lock=true" param:
> > 
> >    {"execute": "x-oob-test", "id": "lock-dispatcher-cmd",
> >     "arguments": { "lock": true } }
> > 
> >    After sending this command, we should not see any "return", because
> >    main thread is blocked already.  But we can still use the monitor
> >    since the monitor now has dedicated IOThread.
> > 
> > 8. On destination side, provide a new incoming port using the new
> >    command "migrate-recover" (note that if step 7 is carried out, we
> >    _must_ use OOB form, otherwise the command will hang.  With OOB,
> >    this command will return immediately):
> > 
> >   {"execute": "migrate-recover", "id": "recover-cmd",
> >    "arguments": { "uri": "tcp:localhost:5556" },
> >    "control": { "run-oob": true } }
> >   {"timestamp": {"seconds": 1512454976, "microseconds": 186053},
> >    "event": "MIGRATION", "data": {"status": "setup"}}
> >   {"return": {}, "id": "recover-cmd"}
> > 
> >    We can see that the command will success even if main thread is
> >    locked up.
> > 
> > 9. (Optional) This step is only needed if step 7 is carried out. On
> >    destination, let's unlock the main thread before resuming the
> >    migration, this time with "lock=false" to unlock the main thread
> >    (since system running needs the main thread). Note that we _must_
> >    use OOB command here too:
> > 
> >   {"execute": "x-oob-test", "id": "unlock-dispatcher",
> >    "arguments": { "lock": false }, "control": { "run-oob": true } }
> >   {"return": {}, "id": "unlock-dispatcher"}
> >   {"return": {}, "id": "lock-dispatcher-cmd"}
> > 
> >   Here the first "return" is the reply to the unlock command, the
> >   second "return" is the reply to the lock command.  After this
> >   command, main thread is released.
> > 
> > 10. On source, resume the postcopy migration:
> > 
> >   {"execute": "migrate", "arguments": { "uri": "tcp:localhost:5556", "resume": true }}
> >   {"return": {}}
> >   {"execute": "query-migrate"}
> >   {"return": {"status": "completed", ...}}
> 
> The use of x-oob-test to lock things is a bit different to reality
> and that means the ordering is different.
> When the destination is blocked by a page request, that page won't
> become unstuck until sometime after (10) happens and delivers the page
> to the target.
> 
> You could try an 'info cpu' on the destination at (7) - although it's
> not guaranteed to lock, depending whether the page needed has arrived.

Yes info cpus (or say "query-cpus", in QMP) would work too.  The
"return" will be delayed until sending the resuming command, but it's
the same thing - here I just want to make sure main thread is totally
hang death, so I can know whether the new accept() port and the whole
workflow will work even with that.

Btw, IMHO "info cpus" should guarantee a block, if not, we just do
something in guest to make sure guest hangs, then at least one vcpu
must be waiting for a page.  Thanks!

-- 
Peter Xu

  reply	other threads:[~2017-12-06  2:39 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-12-05  6:52 [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 01/28] migration: better error handling with QEMUFile Peter Xu
2017-12-05 11:40   ` Dr. David Alan Gilbert
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 02/28] migration: reuse mis->userfault_quit_fd Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 03/28] migration: provide postcopy_fault_thread_notify() Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 04/28] migration: new postcopy-pause state Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 05/28] migration: implement "postcopy-pause" src logic Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 06/28] migration: allow dst vm pause on postcopy Peter Xu
2017-12-14 13:10   ` Dr. David Alan Gilbert
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 07/28] migration: allow src return path to pause Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 08/28] migration: allow send_rq to fail Peter Xu
2017-12-14 13:21   ` Dr. David Alan Gilbert
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 09/28] migration: allow fault thread to pause Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 10/28] qmp: hmp: add migrate "resume" option Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 11/28] migration: pass MigrationState to migrate_init() Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 12/28] migration: rebuild channel on source Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 13/28] migration: new state "postcopy-recover" Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 14/28] migration: wakeup dst ram-load-thread for recover Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 15/28] migration: new cmd MIG_CMD_RECV_BITMAP Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 16/28] migration: new message MIG_RP_MSG_RECV_BITMAP Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 17/28] migration: new cmd MIG_CMD_POSTCOPY_RESUME Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 18/28] migration: new message MIG_RP_MSG_RESUME_ACK Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 19/28] migration: introduce SaveVMHandlers.resume_prepare Peter Xu
2017-12-05  6:52 ` [Qemu-devel] [PATCH v5 20/28] migration: synchronize dirty bitmap for resume Peter Xu
2017-12-05  6:53 ` [Qemu-devel] [PATCH v5 21/28] migration: setup ramstate " Peter Xu
2017-12-05  6:53 ` [Qemu-devel] [PATCH v5 22/28] migration: final handshake for the resume Peter Xu
2017-12-05  6:53 ` [Qemu-devel] [PATCH v5 23/28] migration: free SocketAddress where allocated Peter Xu
2017-12-05  6:53 ` [Qemu-devel] [PATCH v5 24/28] migration: init dst in migration_object_init too Peter Xu
2017-12-05  6:53 ` [Qemu-devel] [PATCH v5 25/28] io: let watcher of the channel run in same ctx Peter Xu
2017-12-05  6:53 ` [Qemu-devel] [PATCH v5 26/28] migration: allow migrate_cancel to pause postcopy Peter Xu
2017-12-19 10:58   ` Dr. David Alan Gilbert
2018-01-24  8:28     ` Peter Xu
2018-01-24  9:06       ` Dr. David Alan Gilbert
2017-12-05  6:53 ` [Qemu-devel] [PATCH v5 27/28] qmp/migration: new command migrate-recover Peter Xu
2017-12-05  6:53 ` [Qemu-devel] [PATCH v5 28/28] hmp/migration: add migrate_recover command Peter Xu
2017-12-05  6:55 ` [Qemu-devel] [PATCH v5 00/28] Migration: postcopy failure recovery Peter Xu
2017-12-05 18:43 ` Dr. David Alan Gilbert
2017-12-06  2:39   ` Peter Xu [this message]
2018-01-11 16:59 ` Dr. David Alan Gilbert
2018-01-12  9:27   ` Peter Xu
2018-01-12 12:27     ` Dr. David Alan Gilbert
2018-01-24  6:19       ` Peter Xu
2018-01-24  9:05         ` Dr. David Alan Gilbert

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171206023945.GC2797@xz-mi \
    --to=peterx@redhat.com \
    --cc=a.perevalov@samsung.com \
    --cc=aarcange@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.