Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Walid Nouri <walid.nouri@gmail.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: kwolf@redhat.com, eddie.dong@intel.com,
	"Dr. David Alan Gilbert" <dgilbert@redhat.com>,
	"Michael R. Hines" <mrhines@linux.vnet.ibm.com>,
	qemu-devel@nongnu.org, Paolo Bonzini <pbonzini@redhat.com>,
	yanghy@cn.fujitsu.com
Subject: Re: [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency
Date: Tue, 23 Sep 2014 18:36:42 +0200	[thread overview]
Message-ID: <5421A19A.9090201@gmail.com> (raw)
In-Reply-To: <20140918135604.GB16227@stefanha-thinkpad.redhat.com>

Am 18.09.2014 15:56, schrieb Stefan Hajnoczi:
> There is the issue of request ordering (using write cache flushes).  The
> secondary probably needs to perform requests in the same order and
> interleave cache flushes in the same way as the primary.  Otherwise a
> power failure on the secondary could leave the disk in an invalid state
> that is impossible on the primary.  So I'm just pointing out that cache
> flush operations matter, not just read/write.

To be honest, my thought was that drive-mirror handles all block device 
specific problems especially the cache flush requests for write 
ordering. So my naive approach was to use an existing functionality as a 
kind of black box transport mechanism and build on top of it. But that 
seems to be not possible for the subtle tricky part of the game.

This means the "block filter" on the secondary must ensure the commit 
semantics. But for doing that it must be able to interpret the write 
ordering semantic of a  stream of write requests.

>
> The second, and bigger, point is that if disk commit holds back
> checkpoint commit it could be a significant performance problem due to
> the slow nature of disks.
You are completely right. This would raise the latency for the primary. 
This can be done by changing the proposed protocol to write directly at 
the primary and asynchronously applying updates to the secondary.

> There are fancier solutions using either a journal or snapshots that
> provide data integrity without posing a performance bottleneck during
> the commit phase.
>
> The trick is to apply write requests as they come off the wire on the
> secondary but use a journal or snapshot mechanism to enforce commit
> semantics.  That way the commit doesn't have to wait for writing out all
> the data to disk.
>
Wouldn't that mean to send a kind of protocol information with the 
modified Blocks, a barrier or somthing like that?
Can you please explain a little more what you meant?

> The details depend on the code and I don't remember everything well
> enough.  Anyway, my mental model is:
>
> 1. The dirty bit is set *after* the primary has completed the write.
>     See bdrv_aligned_pwritev().  Therefore you cannot use the dirty
>     bitmap to query in-flight requests, instead you have to look at
>     bs->tracked_requests.
>
> 2. The mirror block job periodically scans the dirty bitmap (when there
>     is no rate-limit set it does this with no artifical delays) and
>     writes the dirty blocks.
>
> Given that cache flush requests probably need to be tracked too, maybe
> you need MC-specific block driver on the primary to monitor and control
> I/O requests.
>
> But I haven't thought this through and it's non-trivial so we need to
> break this down more.
>

As drive-mirror lacks this functionality a way (without changing the 
drive-mirror code) might be a MC-specific mechanism on the primary. This 
mechanism must respect write ordering requests (like forced cache flush, 
and Force Unit Access request) and send corresponding information for a 
stream of blocks to the secondary.

 From what I have learned i'm assuming most guest OS filesystem/block 
layer follows an ordering interface based on SCSI???? As those kind of 
requests must be flaged in an I/O request by the guest operating system 
this should be possible. Do we have the chance to access those 
information in a guest request?

If this is possible does this information survives the journey through 
the nbd-server or must there be another communication channel like the 
QEMUFile approach of “block-migration.c”?

Walid

next prev parent reply	other threads:[~2014-09-23 16:37 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <53D8FF52.9000104@gmail.com>
     [not found] ` <1406820870.2680.3.camel@usa>
     [not found]   ` <53DBE726.4050102@gmail.com>
     [not found]     ` <1406947532.2680.11.camel@usa>
     [not found]       ` <53E0AA60.9030404@gmail.com>
     [not found]         ` <1407376929.21497.2.camel@usa>
     [not found]           ` <53E60F34.1070607@gmail.com>
     [not found]             ` <1407587152.24027.5.camel@usa>
2014-08-11 17:22               ` [Qemu-devel] Microcheckpointing: Memory-VCPU / Disk State consistency Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-17  9:52                   ` Paolo Bonzini
2014-08-19  8:58                     ` Walid Nouri
2014-09-10 15:43                     ` Walid Nouri
2014-09-11  1:50                       ` Michael R. Hines
2014-09-12  1:34                         ` Hongyang Yang
2014-09-11  7:27                       ` Paolo Bonzini
2014-09-11 17:44                       ` Dr. David Alan Gilbert
2014-09-11 22:08                         ` Walid Nouri
2014-09-12  1:24                         ` Hongyang Yang
2014-09-12 11:07                         ` Stefan Hajnoczi
2014-09-17 20:53                           ` Walid Nouri
2014-09-18 13:56                             ` Stefan Hajnoczi
2014-09-23 16:36                               ` Walid Nouri [this message]
2014-09-24  8:47                                 ` Stefan Hajnoczi
2014-09-25 16:06                                   ` Walid Nouri
2014-08-11 20:15                 ` Michael R. Hines
2014-08-13 14:03                   ` Walid Nouri
2014-08-13 22:28                     ` Michael R. Hines
2014-08-14 10:58                       ` Dr. David Alan Gilbert
2014-08-14 17:23                         ` Michael R. Hines
2014-08-19  8:33                         ` Walid Nouri

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5421A19A.9090201@gmail.com \
    --to=walid.nouri@gmail.com \
    --cc=dgilbert@redhat.com \
    --cc=eddie.dong@intel.com \
    --cc=kwolf@redhat.com \
    --cc=mrhines@linux.vnet.ibm.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@redhat.com \
    --cc=yanghy@cn.fujitsu.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).