From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:37101)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1WFw9I-0000RD-HF
	for qemu-devel@nongnu.org; Tue, 18 Feb 2014 20:40:41 -0500
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1WFw97-0007VL-73
	for qemu-devel@nongnu.org; Tue, 18 Feb 2014 20:40:32 -0500
Received: from e7.ny.us.ibm.com ([32.97.182.137]:35484)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1WFw97-0007VA-1i
	for qemu-devel@nongnu.org; Tue, 18 Feb 2014 20:40:21 -0500
Received: from /spool/local
	by e7.ny.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only!
	Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <mrhines@linux.vnet.ibm.com>;
	Tue, 18 Feb 2014 20:40:20 -0500
Received: from b01cxnp23034.gho.pok.ibm.com (b01cxnp23034.gho.pok.ibm.com
	[9.57.198.29])
	by d01dlp01.pok.ibm.com (Postfix) with ESMTP id 1422238C8045
	for <qemu-devel@nongnu.org>; Tue, 18 Feb 2014 20:40:18 -0500 (EST)
Received: from d01av05.pok.ibm.com (d01av05.pok.ibm.com [9.56.224.195])
	by b01cxnp23034.gho.pok.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	s1J1eIFH64880684
	for <qemu-devel@nongnu.org>; Wed, 19 Feb 2014 01:40:18 GMT
Received: from d01av05.pok.ibm.com (localhost [127.0.0.1])
	by d01av05.pok.ibm.com (8.14.4/8.14.4/NCO v10.0 AVout) with ESMTP id
	s1J1eH32019503
	for <qemu-devel@nongnu.org>; Tue, 18 Feb 2014 20:40:17 -0500
Message-ID: <53040B77.3030008@linux.vnet.ibm.com>
Date: Wed, 19 Feb 2014 09:40:07 +0800
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <1392713429-18201-1-git-send-email-mrhines@linux.vnet.ibm.com>
	<1392713429-18201-2-git-send-email-mrhines@linux.vnet.ibm.com>
	<20140218124550.GF2662@work-vm>
In-Reply-To: <20140218124550.GF2662@work-vm>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH v2 01/12] mc: add documentation for
	micro-checkpointing
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Cc: GILR@il.ibm.com, SADEKJ@il.ibm.com, quintela@redhat.com, BIRAN@il.ibm.com, hinesmr@cn.ibm.com, qemu-devel@nongnu.org, EREZH@il.ibm.com, owasserm@redhat.com, onom@us.ibm.com, junqing.wang@cs2c.com.cn, lig.fnst@cn.fujitsu.com, gokul@us.ibm.com, dbulkow@gmail.com, pbonzini@redhat.com, abali@us.ibm.com, isaku.yamahata@gmail.com, "Michael R. Hines" <mrhines@us.ibm.com>

On 02/18/2014 08:45 PM, Dr. David Alan Gilbert wrote:
>> +The Micro-Checkpointing Process
>> +Basic Algorithm
>> +Micro-Checkpoints (MC) work against the existing live migration path in QEMU, and can effectively be understood as a "live migration that never ends". As such, iteration rounds happen at the granularity of 10s of milliseconds and perform the following steps:
>> +
>> +1. After N milliseconds, stop the VM.
>> +3. Generate a MC by invoking the live migration software path to identify and copy dirty memory into a local staging area inside QEMU.
>> +4. Resume the VM immediately so that it can make forward progress.
>> +5. Transmit the checkpoint to the destination.
>> +6. Repeat
>> +Upon failure, load the contents of the last MC at the destination back into memory and run the VM normally.
> Later you talk about the memory allocation and how you grow the memory as needed
> to fit the checkpoint, have you tried going the other way and triggering the
> checkpoints sooner if they're taking too much memory?

There is a 'knob' in this patch called "mc-set-delay" which was designed
to solve exactly that problem. It allows policy or management software
to make an independent decision about what the frequency of the
checkpoints should be.

I wasn't comfortable implementing policy directly inside the patch as
that seemed less likely to get accepted by the community sooner.

>> +1. MC over TCP/IP: Once the socket connection breaks, we assume
>> failure. This happens very early in the loss of the latest MC not only
>> because a very large amount of bytes is typically being sequenced in a
>> TCP stream but perhaps also because of the timeout in acknowledgement
>> of the receipt of a commit message by the destination.
>> +
>> +2. MC over RDMA: Since Infiniband does not provide any underlying
>> timeout mechanisms, this implementation enhances QEMU's RDMA migration
>> protocol to include a simple keep-alive. Upon the loss of multiple
>> keep-alive messages, the sender is deemed to have failed.
>> +
>> +In both cases, either due to a failed TCP socket connection or lost RDMA keep-alive group, both the sender or the receiver can be deemed to have failed.
>> +
>> +If the sender is deemed to have failed, the destination takes over immediately using the contents of the last checkpoint.
>> +
>> +If the destination is deemed to be lost, we perform the same action
>> as a live migration: resume the sender normally and wait for management
>> software to make a policy decision about whether or not to re-protect
>> the VM, which may involve a third-party to identify a new destination
>> host again to use as a backup for the VM.
> In this world what is making the decision about whether the sender/destination
> should win - how do you avoid a split brain situation where both
> VMs are running but the only thing that failed is the comms between them?
> Is there any guarantee that you'll have received knowledge of the comms
> failure before you pull the plug out and enable the corked packets to be
> sent on the sender side?

Good question in general - I'll add it to the FAQ. The patch implements
a basic 'transaction' mechanism in coordination with an outbound I/O
buffer (documented further down). With these two things in
places, split-brain is not possible because the destination is not running.
We don't allow the destination to resume execution until a committed
transaction has been acknowledged by the destination and only until
then do we allow any outbound network traffic to be release to the
outside world.

> <snip>
>
>> +RDMA is used for two different reasons:
>> +
>> +1. Checkpoint generation (RDMA-based memcpy):
>> +2. Checkpoint transmission
>> +Checkpoint generation must be done while the VM is paused. In the
>> worst case, the size of the checkpoint can be equal in size to the amount
>> of memory in total use by the VM. In order to resume VM execution as
>> fast as possible, the checkpoint is copied consistently locally into
>> a staging area before transmission. A standard memcpy() of potentially
>> such a large amount of memory not only gets no use out of the CPU cache
>> but also potentially clogs up the CPU pipeline which would otherwise
>> be useful by other neighbor VMs on the same physical node that could be
>> scheduled for execution. To minimize the effect on neighbor VMs, we use
>> RDMA to perform a "local" memcpy(), bypassing the host processor. On
>> more recent processors, a 'beefy' enough memory bus architecture can
>> move memory just as fast (sometimes faster) as a pure-software CPU-only
>> optimized memcpy() from libc. However, on older computers, this feature
>> only gives you the benefit of lower CPU-utilization at the expense of
> Isn't there a generic kernel DMA ABI for doing this type of thing (I
> think there was at one point, people have suggested things like using
> graphics cards to do it but I don't know if it ever happened).
> The other question is, do you always need to copy - what about something
> like COWing the pages?

Excellent question! Responding in two parts:

1) The kernel ABI 'vmsplice' is what I think you're referring to. Correct
      me if I'm wrong, but vmsplice was actually designed to avoid copies
      entirely between two userspace programs to be able to move memory
      more efficiently - whereas a fault tolerant system actually *needs*
      copy to be made.

2) Using COW: Actually, I think that's an excellent idea. I've bounced that
      around with my colleagues, but we simply didn't have the manpower
      to implement it and benchmark it. There was also some concern about
      performance: Would the writable working set of the guest be so 
active/busy
      that COW would not get you much benefit? I think it's worth a try.
      Patches welcome =)

- Michael