Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Walid Nouri <walid.nouri@gmail.com>
To: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>,
	qemu-devel@nongnu.org, michael@hinespot.com, hinesmr@cn.ibm.com
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
Date: Wed, 13 Aug 2014 16:03:18 +0200	[thread overview]
Message-ID: <53EB7026.805@gmail.com> (raw)
In-Reply-To: <53E9247F.4030909@linux.vnet.ibm.com>

Yes...
Time is a problem, and it‘s currently running out... ;-)

I think the first step is to reason about possible approaches and how 
they can be implemented in QEMU. The implementation can follow later :-)

Thank you for the hint with the drive-mirror feature.
I will take a look at it and surely come back with new questions :-)

I also think that disc replication ist the most pressing issue for MC.
While looking to find some ideas for approaches to replicating block 
devices I have read the paper about the Remus implementation. I think MC 
can take a similar approach for local disk.

Here are the main facts that I have understood:

Local disk contents is viewed as internal state the primary and secondary.
In the explanation they describe that for keeping disc semantics of the 
primary and to allow the primary to run speculatively all disc state 
changes are directly written to the disk. In parrallel and 
asynchronously send to the secondary. The secondary keeps the pending 
writing requests in two disk buffers. A speculation-disk-buffer and a 
write-out-buffer.

After the reception of the next checkpoint the secondary copies the 
speculation buffer to the write out buffer, commits the checkpoint and 
applies the write out buffer to its local disk.

When the primary fails the secondary must wait until write-out-buffer 
has been completely written to disk before before changing the execution 
mode to run as primary. In this case (failure of primary) the secondary 
discards pending disk writes in its speculation buffer. This protocol 
keeps the disc state consistent with the last checkpoint.

Remus uses the XEN specific blktap driver. As far as I know this can’t 
be used with QEMU (KVM).

I must see how drive-mirror can be used for this kind of protocol.

Am 11.08.2014 22:15, schrieb Michael R. Hines:
> Excellent question: QEMU does have a feature called "drive-mirror"
> in block/mirror.c that was introduced a couple of years ago. I'm not
> sure what the
> adoption rate of the feature is, but I would start with that one.
>
> There is also a second fault tolerance implementation that works a
> little differently called
> "COLO" - you may have seen those emails on the list too, but their
> method does not require a disk replication solution, if I recall correctly.

I have taken a look at COLO. They have also published a good paper about 
their approach. The paper is about the XEN implementation of COLO.  It’s 
an interesting approach to use a coarse grained lockspepping combined 
with checkpointing. From what i have understood the possible show 
stopper for general application is the depency of COLO to custom changes 
of the tcp stack to make it more deterministic.

IMHO there are two points. Custom changes of the TCP-Stack are a no-go 
for proprietary operating systems like Windows. It makes COLO 
application agnostic but not operating system agnostic. The other point 
is that with I/O intensive workloads COLO will tend to behave like MC. 
This is my point of view but i didn’t invest much time to understand 
everything in detail.

>
> I know the time pressure that comes during a thesis, though =), so
> there's no pressure to work on it - but that is the most pressing issue
> in the implementation today. (Lack of disk replication in
> micro-checkpointing.)
>
> The MC implementation also needs to be re-based against the latest
> master - I just haven't had a chance to do it yet because some of my
> hardware has been taken away from me the last few months - will
> see if I can find some reasonable hardware soon.
>
> - Michael
>
> On 08/12/2014 01:22 AM, Walid Nouri wrote:
>> Hi,
>> I will do my best to make a contribution :-)
>>
>> Are there alternative ways of replicating local storage other than
>> DRBD that are possibly feasible?
>> Some that are directly build into Qemu?
>>
>> Walid
>>
>> Am 09.08.2014 14:25, schrieb Michael R. Hines:
>>> On Sat, 2014-08-09 at 14:08 +0200, Walid Nouri wrote:
>>>> Hi Michael,
>>>> how is the weather in Bejing? :-)
>>> It's terrible. Lots of pollution =(
>>>
>>>> May I ask you some questions to your MC implementation?
>>>>
>>>> Currently i'm trying  to understand the general working of the MC
>>>> protokoll and possible problems that can occur so that I can discuss it
>>>> in my thesis.
>>>>
>>>> As far as i have understand MC relies on a shared disk. Output of the
>>>> primary vm are directly written, network output is buffered until the
>>>> corresponding checkpoint is acknowledged.
>>>>
>>>> One problem that comes into my mind is: What happens when the
>>>> primary vm
>>>> writes to the disk and crashes before sending a corresponding
>>>> checkpoint?
>>>>
>>> The MC implementation itself is incomplete, today. (I need help).
>>>
>>> The Xen Remus implementation uses the DRBD system to "mirror" all disk
>>> writes to the source and destination before completing each checkpoint.
>>>
>>> The KVM (mc) implementation needs exactly the same support, but it is
>>> missing today.
>>>
>>> Until that happens, we are *required* to use root-over-iSCSI or
>>> root-over-NFS (meaning that the guest filesystem is mounted directly
>>> inside the virtual machine without the host knowing about it.
>>>
>>> This has the effect of translating all disk I/O into network I/O,
>>> and since network I/O is already buffered, then we are safe.
>>>
>>>
>>>> Here an example: The Primary state is in the actual epoch epoch (n),
>>>> secondary state is in epoch (n-1). The primary writes to disk and
>>>> crashes before or while sending the checkpoint n. In this case the
>>>> secondary memory state is still at epoch (n-1) and the state of the
>>>> shared Disk corresponds to the primary state of epoch (n).
>>>>
>>>> How does MC guaranty that the Disk state of the backup vm is consistent
>>>> with its Memory state?
>>> As I mentioned above, we need the equivalent of the Xen solution, but I
>>> just haven't had the time to write it (or incorporate someone else's
>>> implementation). Patch is welcome =)
>>>
>>>> Is Memory-VCPU / Disk State consistency necessary under all
>>>> circumstances?
>>>> Or can this be neglected because the secondary will (after a fail over)
>>>> repeat the same instructions and finally write to disk the same (as the
>>>> primary before) data for a second time?
>>>> Could this lead to fatal inconsistencies?
>>>>
>>>> Walid
>>>>
>>>
>>>
>>>
>>> - Michael
>>>
>>>
>>>
>>
>>
>

next prev parent reply	other threads:[~2014-08-13 14:03 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <53D8FF52.9000104@gmail.com>
     [not found] ` <1406820870.2680.3.camel@usa>
     [not found]   ` <53DBE726.4050102@gmail.com>
     [not found]     ` <1406947532.2680.11.camel@usa>
     [not found]       ` <53E0AA60.9030404@gmail.com>
     [not found]         ` <1407376929.21497.2.camel@usa>
     [not found]           ` <53E60F34.1070607@gmail.com>
     [not found]             ` <1407587152.24027.5.camel@usa>
2014-08-11 17:22               ` [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-17  9:52                   ` Paolo Bonzini
2014-08-19  8:58                     ` Walid Nouri
2014-09-10 15:43                     ` Walid Nouri
2014-09-11  1:50                       ` Michael R. Hines
2014-09-12  1:34                         ` Hongyang Yang
2014-09-11  7:27                       ` Paolo Bonzini
2014-09-11 17:44                       ` Dr. David Alan Gilbert
2014-09-11 22:08                         ` Walid Nouri
2014-09-12  1:24                         ` Hongyang Yang
2014-09-12 11:07                         ` Stefan Hajnoczi
2014-09-17 20:53                           ` Walid Nouri
2014-09-18 13:56                             ` Stefan Hajnoczi
2014-09-23 16:36                               ` Walid Nouri
2014-09-24  8:47                                 ` Stefan Hajnoczi
2014-09-25 16:06                                   ` Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-13 14:03                   ` Walid Nouri [this message]
2014-08-13 22:28                     ` Michael R. Hines
2014-08-14 10:58                       ` Dr. David Alan Gilbert
2014-08-14 17:23                         ` Michael R. Hines
2014-08-19  8:33                         ` Walid Nouri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53EB7026.805@gmail.com \
    --to=walid.nouri@gmail.com \
    --cc=hinesmr@cn.ibm.com \
    --cc=michael@hinespot.com \
    --cc=mrhines@linux.vnet.ibm.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).