All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: Walid Nouri <walid.nouri@gmail.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	qemu-devel@nongnu.org, michael@hinespot.com, hinesmr@cn.ibm.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	Hongyang Yang <yanghy@cn.fujitsu.com>,
	Dong Eddie <eddie.dong@intel.com>,
	FNST-Gui Jianfeng <GuiJianfeng@cn.fujitsu.com>,
	wency@cn.fujitsu.com
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
Date: Thu, 11 Sep 2014 09:50:26 +0800	[thread overview]
Message-ID: <5410FFE2.40308@linux.vnet.ibm.com> (raw)
In-Reply-To: <54107187.8040706@gmail.com>

On 09/10/2014 11:43 PM, Walid Nouri wrote:
> Hello Michael, Hello Paolo
> i have „studied“ the available documentation/Information and tried to 
> get an idea of the QEMU live block operation possibilities.
>
> I think the MC protocol doesn’t need synchronous block device 
> replication because primary and secondary VM are not synchronous. The 
> state of the primary is allays ahead of the state of the secondary. 
> When the primary is in epoch(n) the secondary is in epoch(n-1).
>
> What MC needs is a block device agnostic, controlled and asynchronous 
> approach for replicating the contents of block devices and its state 
> changes to the secondary VM while the primary VM is running. 
> Asynchronous block transfer is important to allow maximum performance 
> for the primary VM, while keeping the secondary VM updated with state 
> changes.
>
> The block device replication should be possible in two stages or modes.
>
> The first stage is the live copy of all block devices of the primary 
> to the secondary. This is necessary if the secondary doesn’t have an 
> existing image which is in sync with the primary at the time MC has 
> started. This is not very convenient but as far as I know actually 
> there is no mechanism for persistent dirty bitmap in QEMU.
>
> The second stage (mode) is the replication of block device state 
> changes (modified blocks) to keep the image on the secondary in sync 
> with the primary. The mirrored blocks must be buffered in ram (block 
> buffer) until the complete Checkpoint (RAM, vCPU, device state) can be 
> committed.
>
> For keeping the complete system state consistent on the secondary 
> system there must be a possibility for MC to commit/discard block 
> device state changes. In normal operation the mirrored block device 
> state changes (block buffer) are committed to disk when the complete 
> checkpoint is committed. In case of a crash of the primary system 
> while transferring a checkpoint the data in the block buffer 
> corresponding to the failed Checkpoint must be discarded.
>
> The storage architecture should be “shared nothing” so that no shared 
> storage is required and primary/secondary can have separate block 
> device images.
>
> I think this can be achieved by drive-mirror and a filter block 
> driver. Another approach could be to exploit the block migration 
> functionality of live migration with a filter block driver.
>
> The drive-mirror (and live migration) does not rely on shared storage 
> and allow live block device copy and incremental syncing.
>
> A block buffer can be implemented with a QEMU filter block driver. It 
> should sit at the same position as the Quorum driver in the block 
> driver hierarchy. When using block filter approach MC will be 
> transparent and block device agnostic.
>
> The block buffer filter must have an Interface which allows MC control 
> the commits or discards of block device state changes. I have no idea 
> where to put such an interface to stay conform with QEMU coding style.
>
>
> I’m sure there are alternative and better approaches and I’m open for 
> any ideas
>
>
> Walid
>
> Am 17.08.2014 11:52, schrieb Paolo Bonzini:
>> Il 11/08/2014 22:15, Michael R. Hines ha scritto:
>>> Excellent question: QEMU does have a feature called "drive-mirror"
>>> in block/mirror.c that was introduced a couple of years ago. I'm not
>>> sure what the
>>> adoption rate of the feature is, but I would start with that one.
>>
>> block/mirror.c is asynchronous, and there's no support for communicating
>> checkpoints back to the master. However, the quorum disk driver could
>> be what you need.
>>
>> There's also a series on the mailing list that lets quorum read only
>> from the primary, so that quorum can still do replication and fault
>> tolerance, but skip fault detection.
>>
>> Paolo
>>
>>> There is also a second fault tolerance implementation that works a
>>> little differently called
>>> "COLO" - you may have seen those emails on the list too, but their
>>> method does not require a disk replication solution, if I recall 
>>> correctly.
>>
>

Nice description of the problem - would you like to put this information 
on the MC wiki page? (Just send an email to the list that says "request 
for wiki account, please" in the subject - and they will make an account 
for you.

A drive-mirror + filter driver solution sounds like a good plan overall,
of course the devil is in the details =)

I don't know how much time you have to spend on actual code, but even a 
description of what a "theoretical" interface between MC and 
drive-mirror would look like would go a long way even without code.

Your investigations would also help "drive" a solution to this problem 
for the COLO team as well - I believe they need the same thing....

- Michael

  reply	other threads:[~2014-09-11  1:50 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <53D8FF52.9000104@gmail.com>
     [not found] ` <1406820870.2680.3.camel@usa>
     [not found]   ` <53DBE726.4050102@gmail.com>
     [not found]     ` <1406947532.2680.11.camel@usa>
     [not found]       ` <53E0AA60.9030404@gmail.com>
     [not found]         ` <1407376929.21497.2.camel@usa>
     [not found]           ` <53E60F34.1070607@gmail.com>
     [not found]             ` <1407587152.24027.5.camel@usa>
2014-08-11 17:22               ` [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-17  9:52                   ` Paolo Bonzini
2014-08-19  8:58                     ` Walid Nouri
2014-09-10 15:43                     ` Walid Nouri
2014-09-11  1:50                       ` Michael R. Hines [this message]
2014-09-12  1:34                         ` Hongyang Yang
2014-09-11  7:27                       ` Paolo Bonzini
2014-09-11 17:44                       ` Dr. David Alan Gilbert
2014-09-11 22:08                         ` Walid Nouri
2014-09-12  1:24                         ` Hongyang Yang
2014-09-12 11:07                         ` Stefan Hajnoczi
2014-09-17 20:53                           ` Walid Nouri
2014-09-18 13:56                             ` Stefan Hajnoczi
2014-09-23 16:36                               ` Walid Nouri
2014-09-24  8:47                                 ` Stefan Hajnoczi
2014-09-25 16:06                                   ` Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-13 14:03                   ` Walid Nouri
2014-08-13 22:28                     ` Michael R. Hines
2014-08-14 10:58                       ` Dr. David Alan Gilbert
2014-08-14 17:23                         ` Michael R. Hines
2014-08-19  8:33                         ` Walid Nouri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5410FFE2.40308@linux.vnet.ibm.com \
    --to=mrhines@linux.vnet.ibm.com \
    --cc=GuiJianfeng@cn.fujitsu.com \
    --cc=dgilbert@redhat.com \
    --cc=eddie.dong@intel.com \
    --cc=hinesmr@cn.ibm.com \
    --cc=michael@hinespot.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=walid.nouri@gmail.com \
    --cc=wency@cn.fujitsu.com \
    --cc=yanghy@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.