qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: GILR@il.ibm.com, SADEKJ@il.ibm.com, quintela@redhat.com,
	BIRAN@il.ibm.com, hinesmr@cn.ibm.com, qemu-devel@nongnu.org,
	EREZH@il.ibm.com, owasserm@redhat.com, onom@us.ibm.com,
	junqing.wang@cs2c.com.cn, lig.fnst@cn.fujitsu.com,
	gokul@us.ibm.com, dbulkow@gmail.com, pbonzini@redhat.com,
	abali@us.ibm.com, isaku.yamahata@gmail.com,
	"Michael R. Hines" <mrhines@us.ibm.com>
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing
Date: Wed, 19 Feb 2014 09:40:07 +0800	[thread overview]
Message-ID: <53040B77.3030008@linux.vnet.ibm.com> (raw)
In-Reply-To: <20140218124550.GF2662@work-vm>

On 02/18/2014 08:45 PM, Dr. David Alan Gilbert wrote:
>> +The Micro-Checkpointing Process
>> +Basic Algorithm
>> +Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
>> +
>> +1. After N milliseconds, stop the VM.
>> +3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
>> +4. Resume the VM immediately so that it can make forward progress.
>> +5. Transmit the checkpoint to the destination.
>> +6. Repeat
>> +Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.
> Later you talk about the memory allocation and how you grow the memory as needed
> to fit the checkpoint, have you tried going the other way and triggering the
> checkpoints sooner if they're taking too much memory?

There is a 'knob' in this patch called "mc-set-delay" which was designed
to solve exactly that problem. It allows policy or management software
to make an independent decision about what the frequency of the
checkpoints should be.

I wasn't comfortable implementing policy directly inside the patch as
that seemed less likely to get accepted by the community sooner.

>> +1. MC over TCP/IP: Once the socket connection breaks, we assume
>> failure. This happens very early in the loss of the latest MC not only
>> because a very large amount of bytes is typically being sequenced in a
>> TCP stream but perhaps also because of the timeout in acknowledgement
>> of the receipt of a commit message by the destination.
>> +
>> +2. MC over RDMA: Since Infiniband does not provide any underlying
>> timeout mechanisms, this implementation enhances QEMU's RDMA migration
>> protocol to include a simple keep-alive. Upon the loss of multiple
>> keep-alive messages, the sender is deemed to have failed.
>> +
>> +In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
>> +
>> +If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
>> +
>> +If the destination is deemed to be lost, we perform the same action
>> as a live migration: resume the sender normally and wait for management
>> software to make a policy decision about whether or not to re-protect
>> the VM, which may involve a third-party to identify a new destination
>> host again to use as a backup for the VM.
> In this world what is making the decision about whether the sender/destination
> should win - how do you avoid a split brain situation where both
> VMs are running but the only thing that failed is the comms between them?
> Is there any guarantee that you'll have received knowledge of the comms
> failure before you pull the plug out and enable the corked packets to be
> sent on the sender side?

Good question in general - I'll add it to the FAQ. The patch implements
a basic 'transaction' mechanism in coordination with an outbound I/O
buffer (documented further down). With these two things in
places, split-brain is not possible because the destination is not running.
We don't allow the destination to resume execution until a committed
transaction has been acknowledged by the destination and only until
then do we allow any outbound network traffic to be release to the
outside world.

> <snip>
>
>> +RDMA is used for two different reasons:
>> +
>> +1. Checkpoint generation (RDMA-based memcpy):
>> +2. Checkpoint transmission
>> +Checkpoint generation must be done while the VM is paused. In the
>> worst case, the size of the checkpoint can be equal in size to the amount
>> of memory in total use by the VM. In order to resume VM execution as
>> fast as possible, the checkpoint is copied consistently locally into
>> a staging area before transmission. A standard memcpy() of potentially
>> such a large amount of memory not only gets no use out of the CPU cache
>> but also potentially clogs up the CPU pipeline which would otherwise
>> be useful by other neighbor VMs on the same physical node that could be
>> scheduled for execution. To minimize the effect on neighbor VMs, we use
>> RDMA to perform a "local" memcpy(), bypassing the host processor. On
>> more recent processors, a 'beefy' enough memory bus architecture can
>> move memory just as fast (sometimes faster) as a pure-software CPU-only
>> optimized memcpy() from libc. However, on older computers, this feature
>> only gives you the benefit of lower CPU-utilization at the expense of
> Isn't there a generic kernel DMA ABI for doing this type of thing (I
> think there was at one point, people have suggested things like using
> graphics cards to do it but I don't know if it ever happened).
> The other question is, do you always need to copy - what about something
> like COWing the pages?

Excellent question! Responding in two parts:

1) The kernel ABI 'vmsplice' is what I think you're referring to. Correct
      me if I'm wrong, but vmsplice was actually designed to avoid copies
      entirely between two userspace programs to be able to move memory
      more efficiently - whereas a fault tolerant system actually *needs*
      copy to be made.

2) Using COW: Actually, I think that's an excellent idea. I've bounced that
      around with my colleagues, but we simply didn't have the manpower
      to implement it and benchmark it. There was also some concern about
      performance: Would the writable working set of the guest be so 
active/busy
      that COW would not get you much benefit? I think it's worth a try.
      Patches welcome =)

- Michael

  reply	other threads:[~2014-02-19  1:40 UTC|newest]

Thread overview: 68+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-02-18  8:50 [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for micro-checkpointing mrhines
2014-02-18 12:45   ` Dr. David Alan Gilbert
2014-02-19  1:40     ` Michael R. Hines [this message]
2014-02-19 11:27       ` Dr. David Alan Gilbert
2014-02-20  1:17         ` Michael R. Hines
2014-02-20 10:09           ` Dr. David Alan Gilbert
2014-02-20 11:14             ` Li Guang
2014-02-20 14:58               ` Michael R. Hines
2014-02-20 14:57             ` Michael R. Hines
2014-02-20 16:32               ` Dr. David Alan Gilbert
2014-02-21  4:54                 ` Michael R. Hines
2014-02-21  9:44                   ` Dr. David Alan Gilbert
2014-03-03  6:08                     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 02/12] mc: timestamp migration_bitmap and KVM logdirty usage mrhines
2014-02-18 10:32   ` Dr. David Alan Gilbert
2014-02-19  1:42     ` Michael R. Hines
2014-03-11 21:31   ` Juan Quintela
2014-04-04  3:08     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 03/12] mc: introduce a 'checkpointing' status check into the VCPU states mrhines
2014-03-11 21:36   ` Juan Quintela
2014-04-04  3:11     ` Michael R. Hines
2014-03-11 21:40   ` Eric Blake
2014-04-04  3:12     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 04/12] mc: support custom page loading and copying mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 05/12] rdma: accelerated memcpy() support and better external RDMA user interfaces mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 06/12] mc: introduce state machine changes for MC mrhines
2014-02-19  1:00   ` Li Guang
2014-02-19  2:14     ` Michael R. Hines
2014-02-20  5:03     ` Michael R. Hines
2014-02-21  8:13     ` Michael R. Hines
2014-02-24  6:48       ` Li Guang
2014-02-26  2:52         ` Li Guang
2014-03-11 21:57   ` Juan Quintela
2014-04-04  3:50     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 07/12] mc: introduce additional QMP statistics for micro-checkpointing mrhines
2014-03-11 21:45   ` Eric Blake
2014-04-04  3:15     ` Michael R. Hines
2014-04-04  4:22       ` Eric Blake
2014-03-11 21:59   ` Juan Quintela
2014-04-04  3:55     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 08/12] mc: core logic mrhines
2014-02-19  1:07   ` Li Guang
2014-02-19  2:16     ` Michael R. Hines
2014-02-19  2:53       ` Li Guang
2014-02-19  4:27         ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 09/12] mc: configure and makefile support mrhines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 10/12] mc: expose tunable parameter for checkpointing frequency mrhines
2014-03-11 21:49   ` Eric Blake
2014-03-11 22:15     ` Juan Quintela
2014-03-11 22:49       ` Eric Blake
2014-04-04  5:29         ` Michael R. Hines
2014-04-04 14:56           ` Eric Blake
2014-04-11  6:10             ` Michael R. Hines
2014-04-04 16:28           ` Dr. David Alan Gilbert
2014-04-04 16:35             ` Eric Blake
2014-04-04  3:29     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 11/12] mc: introduce new capabilities to control micro-checkpointing mrhines
2014-03-11 21:57   ` Eric Blake
2014-04-04  3:38     ` Michael R. Hines
2014-04-04  4:25       ` Eric Blake
2014-03-11 22:02   ` Juan Quintela
2014-03-11 22:07     ` Eric Blake
2014-04-04  3:57       ` Michael R. Hines
2014-04-04  3:56     ` Michael R. Hines
2014-02-18  8:50 ` [Qemu-devel] [RFC PATCH v2 12/12] mc: activate and use MC if requested mrhines
2014-02-18  9:28 ` [Qemu-devel] [RFC PATCH v2 00/12] mc: fault tolerante through micro-checkpointing Li Guang
2014-02-19  1:29   ` Michael R. Hines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53040B77.3030008@linux.vnet.ibm.com \
    --to=mrhines@linux.vnet.ibm.com \
    --cc=BIRAN@il.ibm.com \
    --cc=EREZH@il.ibm.com \
    --cc=GILR@il.ibm.com \
    --cc=SADEKJ@il.ibm.com \
    --cc=abali@us.ibm.com \
    --cc=dbulkow@gmail.com \
    --cc=dgilbert@redhat.com \
    --cc=gokul@us.ibm.com \
    --cc=hinesmr@cn.ibm.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=junqing.wang@cs2c.com.cn \
    --cc=lig.fnst@cn.fujitsu.com \
    --cc=mrhines@us.ibm.com \
    --cc=onom@us.ibm.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=quintela@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).