qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: Walid Nouri <walid.nouri@gmail.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	qemu-devel@nongnu.org, michael@hinespot.com, hinesmr@cn.ibm.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Hongyang Yang <yanghy@cn.fujitsu.com>,
	Dong Eddie <eddie.dong@intel.com>,
	FNST-Gui Jianfeng <GuiJianfeng@cn.fujitsu.com>,
	wency@cn.fujitsu.com
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
Date: Thu, 11 Sep 2014 09:50:26 +0800	[thread overview]
Message-ID: <5410FFE2.40308@linux.vnet.ibm.com> (raw)
In-Reply-To: <54107187.8040706@gmail.com>

On 09/10/2014 11:43 PM, Walid Nouri wrote:
> Hello Michael, Hello Paolo
> i have „studied“ the available documentation/Information and tried to 
> get an idea of the QEMU live block operation possibilities.
>
> I think the MC protocol doesn’t need synchronous block device 
> replication because primary and secondary VM are not synchronous. The 
> state of the primary is allays ahead of the state of the secondary. 
> When the primary is in epoch(n) the secondary is in epoch(n-1).
>
> What MC needs is a block device agnostic, controlled and asynchronous 
> approach for replicating the contents of block devices and its state 
> changes to the secondary VM while the primary VM is running. 
> Asynchronous block transfer is important to allow maximum performance 
> for the primary VM, while keeping the secondary VM updated with state 
> changes.
>
> The block device replication should be possible in two stages or modes.
>
> The first stage is the live copy of all block devices of the primary 
> to the secondary. This is necessary if the secondary doesn’t have an 
> existing image which is in sync with the primary at the time MC has 
> started. This is not very convenient but as far as I know actually 
> there is no mechanism for persistent dirty bitmap in QEMU.
>
> The second stage (mode) is the replication of block device state 
> changes (modified blocks) to keep the image on the secondary in sync 
> with the primary. The mirrored blocks must be buffered in ram (block 
> buffer) until the complete Checkpoint (RAM, vCPU, device state) can be 
> committed.
>
> For keeping the complete system state consistent on the secondary 
> system there must be a possibility for MC to commit/discard block 
> device state changes. In normal operation the mirrored block device 
> state changes (block buffer) are committed to disk when the complete 
> checkpoint is committed. In case of a crash of the primary system 
> while transferring a checkpoint the data in the block buffer 
> corresponding to the failed Checkpoint must be discarded.
>
> The storage architecture should be “shared nothing” so that no shared 
> storage is required and primary/secondary can have separate block 
> device images.
>
> I think this can be achieved by drive-mirror and a filter block 
> driver. Another approach could be to exploit the block migration 
> functionality of live migration with a filter block driver.
>
> The drive-mirror (and live migration) does not rely on shared storage 
> and allow live block device copy and incremental syncing.
>
> A block buffer can be implemented with a QEMU filter block driver. It 
> should sit at the same position as the Quorum driver in the block 
> driver hierarchy. When using block filter approach MC will be 
> transparent and block device agnostic.
>
> The block buffer filter must have an Interface which allows MC control 
> the commits or discards of block device state changes. I have no idea 
> where to put such an interface to stay conform with QEMU coding style.
>
>
> I’m sure there are alternative and better approaches and I’m open for 
> any ideas
>
>
> Walid
>
> Am 17.08.2014 11:52, schrieb Paolo Bonzini:
>> Il 11/08/2014 22:15, Michael R. Hines ha scritto:
>>> Excellent question: QEMU does have a feature called "drive-mirror"
>>> in block/mirror.c that was introduced a couple of years ago. I'm not
>>> sure what the
>>> adoption rate of the feature is, but I would start with that one.
>>
>> block/mirror.c is asynchronous, and there's no support for communicating
>> checkpoints back to the master. However, the quorum disk driver could
>> be what you need.
>>
>> There's also a series on the mailing list that lets quorum read only
>> from the primary, so that quorum can still do replication and fault
>> tolerance, but skip fault detection.
>>
>> Paolo
>>
>>> There is also a second fault tolerance implementation that works a
>>> little differently called
>>> "COLO" - you may have seen those emails on the list too, but their
>>> method does not require a disk replication solution, if I recall 
>>> correctly.
>>
>

Nice description of the problem - would you like to put this information 
on the MC wiki page? (Just send an email to the list that says "request 
for wiki account, please" in the subject - and they will make an account 
for you.

A drive-mirror + filter driver solution sounds like a good plan overall,
of course the devil is in the details =)

I don't know how much time you have to spend on actual code, but even a 
description of what a "theoretical" interface between MC and 
drive-mirror would look like would go a long way even without code.

Your investigations would also help "drive" a solution to this problem 
for the COLO team as well - I believe they need the same thing....

- Michael

  reply	other threads:[~2014-09-11  1:50 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <53D8FF52.9000104@gmail.com>
     [not found] ` <1406820870.2680.3.camel@usa>
     [not found]   ` <53DBE726.4050102@gmail.com>
     [not found]     ` <1406947532.2680.11.camel@usa>
     [not found]       ` <53E0AA60.9030404@gmail.com>
     [not found]         ` <1407376929.21497.2.camel@usa>
     [not found]           ` <53E60F34.1070607@gmail.com>
     [not found]             ` <1407587152.24027.5.camel@usa>
2014-08-11 17:22               ` [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-17  9:52                   ` Paolo Bonzini
2014-08-19  8:58                     ` Walid Nouri
2014-09-10 15:43                     ` Walid Nouri
2014-09-11  1:50                       ` Michael R. Hines [this message]
2014-09-12  1:34                         ` Hongyang Yang
2014-09-11  7:27                       ` Paolo Bonzini
2014-09-11 17:44                       ` Dr. David Alan Gilbert
2014-09-11 22:08                         ` Walid Nouri
2014-09-12  1:24                         ` Hongyang Yang
2014-09-12 11:07                         ` Stefan Hajnoczi
2014-09-17 20:53                           ` Walid Nouri
2014-09-18 13:56                             ` Stefan Hajnoczi
2014-09-23 16:36                               ` Walid Nouri
2014-09-24  8:47                                 ` Stefan Hajnoczi
2014-09-25 16:06                                   ` Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-13 14:03                   ` Walid Nouri
2014-08-13 22:28                     ` Michael R. Hines
2014-08-14 10:58                       ` Dr. David Alan Gilbert
2014-08-14 17:23                         ` Michael R. Hines
2014-08-19  8:33                         ` Walid Nouri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5410FFE2.40308@linux.vnet.ibm.com \
    --to=mrhines@linux.vnet.ibm.com \
    --cc=GuiJianfeng@cn.fujitsu.com \
    --cc=dgilbert@redhat.com \
    --cc=eddie.dong@intel.com \
    --cc=hinesmr@cn.ibm.com \
    --cc=michael@hinespot.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=walid.nouri@gmail.com \
    --cc=wency@cn.fujitsu.com \
    --cc=yanghy@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).