[Qemu-devel] Notes on block I/O data integrity

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Notes on block I/O data integrity
@ 2009-08-25 18:11 Christoph Hellwig
  2009-08-25 19:33 ` [Qemu-devel] " Javier Guerra
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Christoph Hellwig @ 2009-08-25 18:11 UTC (permalink / raw)
  To: qemu-devel, kvm; +Cc: rusty

As various people wanted to know how the various data integrity patches
I've send out recently play together here's a small writeup on what
issues we have in QEMU and how to fix it:

There are two major aspects of data integrity we need to care in the
QEMU block I/O code:

 (1) stable data storage - we must be able to force data out of caches
     onto the stable media, and we must get completion notification for it.
 (2) request ordering - we must be able to make sure some I/O request
     do not get reordered with other in-flight requests before or after
     it.

Linux uses two related abstractions to implement the this (other operating
system are probably similar)

 (1) a cache flush request that flushes the whole volatile write cache to
     stable storage
 (2) a barrier request, which
      (a) is guaranteed to actually go all the way to stable storage
      (b) does not reordered versus any requests before or after it

For disks not using volatile write caches the cache flush is a no-op,
and barrier requests are implemented by draining the queue of
outstanding requests before the barrier request, and only allowing new
requests to proceed after it has finished.  Instead of the queue drain
tag ordering could be used, but at this point that is not the case in
Linux.

For disks using volatile write caches, the cache flush is implemented by
a protocol specific request, and the the barrier request are implemented
by performing cache flushes before and after the barrier request, in
addition to the draining mentioned above.  The second cache flush can be
replaced by setting the "Force Unit Access" bit on the barrier request 
on modern disks.


The above is supported by the QEMU emulated disks in the following way:

  - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/
    WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not
    indicate a volatile write cache in the ATA IDENTIFY command.  Because
    of that guests do no not actually send down cache flush request.  Linux
    guests do however drain the I/O queues to guarantee ordering in absence
    of volatile write caches.
  - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command,
    and also advertises the write cache enabled bit.  This means Linux
    sends down cache flush requests to implement barriers, and provides
    sufficient queue draining.
  - The virtio-blk driver does not implement any cache flush command.
    And while there is a virtio-blk feature bit for barrier support
    it is not support by virtio-blk.  Due to the lack of a cache flush
    command it also is insufficient to implement the required data
    integrity semantics.  Currently the virtio-blk Linux does not
    advertise any form of barrier support, and we don't even get the
    queue draining required for proper operation in a cache-less
    environment.

The I/O from these front end drivers maps to different host kernel I/O
patterns  depending on the cache= drive command line.  There are three
choices for it:

 (a) cache=writethrough
 (b) cache=writeback
 (c) cache=none

(a) means all writes are synchronous (O_DSYNC), which means the host
    kernel guarantees us that data is on stable storage once the I/O
    request has completed.
    In cache=writethrough mode the IDE and SCSI drivers are safe because
    the queue is properly drained to guarantee I/O ordering.  Virtio-blk
    is not safe due to the lack of queue draining.
(b) means we use regular buffered writes and need a fsync/fdatasync to
    actually guarantee that data is stable on disk.
    In data=writeback mode on the SCSI emulation is safe as all others
    miss the cache flush requests.
(c) means we use direct I/O (O_DIRECT) to bypass the host cache and
    perform direct dma to/from the I/O buffer in QEMU.  While direct I/O
    bypasses the host cache it does not guarantee flushing of volatile
    write caches in disks, nor completion of metadata operations in
    filesystems (e.g. block allocations).
    In data=none only the SCSI emulation is entirely safe right now
    due to the lack of cache flushes in the other drivers.


Action plan for the guest drivers:

 - virtio-blk needs to advertise ordered queue by default.
   This makes cache=writethrough safe on virtio.

Action plan for QEMU:

 - IDE needs to set the write cache enabled bit
 - virtio needs to implement a cache flush command and advertise it
   (also needs a small change to the host driver)
 - we need to implement an aio_fsync to not stall the vpu on cache
   flushes
 - investigate only advertising a write cache when we really have one
   to avoid the cache flush requests for cache=writethrough

Notes on disk cache flushes on Linux hosts:

 - barrier requests and cache flushes are supported by all local
   disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
   However unlike the other filesystems ext3 does _NOT_ enable barriers
   and cache flush requests by default.
 - currently O_SYNC writes or fsync on block device nodes does not
   flush the disk cache.
 - currently none of the filesystems nor the direct access to the block
   device nodes implements flushes of the disk caches when using
   O_DIRECT|O_DSYNC or using fsync/fdatasync after an O_DIRECT request.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-25 18:11 [Qemu-devel] Notes on block I/O data integrity Christoph Hellwig
@ 2009-08-25 19:33 ` Javier Guerra
  2009-08-25 19:36   ` Christoph Hellwig
  2009-08-25 20:25 ` Nikola Ciprich
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Javier Guerra @ 2009-08-25 19:33 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rusty, qemu-devel, kvm

On Tue, Aug 25, 2009 at 1:11 PM, Christoph Hellwig<hch@lst.de> wrote:
>  - barrier requests and cache flushes are supported by all local
>   disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
>   However unlike the other filesystems ext3 does _NOT_ enable barriers
>   and cache flush requests by default.

what about LVM? iv'e read somewhere that it used to just eat barriers
used by XFS, making it less safe than simple partitions.

-- 
Javier

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-25 19:33 ` [Qemu-devel] " Javier Guerra
@ 2009-08-25 19:36   ` Christoph Hellwig
  2009-08-26 18:57     ` Jamie Lokier
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2009-08-25 19:36 UTC (permalink / raw)
  To: Javier Guerra; +Cc: rusty, Christoph Hellwig, kvm, qemu-devel

On Tue, Aug 25, 2009 at 02:33:30PM -0500, Javier Guerra wrote:
> On Tue, Aug 25, 2009 at 1:11 PM, Christoph Hellwig<hch@lst.de> wrote:
> > ??- barrier requests and cache flushes are supported by all local
> > ?? disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
> > ?? However unlike the other filesystems ext3 does _NOT_ enable barriers
> > ?? and cache flush requests by default.
> 
> what about LVM? iv'e read somewhere that it used to just eat barriers
> used by XFS, making it less safe than simple partitions.

Oh, any additional layers open another by cans of worms.  On Linux until
very recently using LVM or software raid means only disabled
write caches are safe.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-25 19:36   ` Christoph Hellwig
@ 2009-08-26 18:57     ` Jamie Lokier
  2009-08-26 22:17       ` Christoph Hellwig
  0 siblings, 1 reply; 13+ messages in thread
From: Jamie Lokier @ 2009-08-26 18:57 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rusty, qemu-devel, kvm, Javier Guerra

Christoph Hellwig wrote:
> > what about LVM? iv'e read somewhere that it used to just eat barriers
> > used by XFS, making it less safe than simple partitions.
> 
> Oh, any additional layers open another by cans of worms.  On Linux until
> very recently using LVM or software raid means only disabled
> write caches are safe.

I believe that's still true except if there's more than one backing
drive, so software RAID still isn't safe.  Did that change?

But even with barriers, software RAID may have a consistency problem
if one stripe is updated and the system fails before the matching
parity stripe is updated.

I've been told that some hardware RAID implementations implement a
kind of journalling to deal with this, but Linux software RAID does not.

-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-26 18:57     ` Jamie Lokier
@ 2009-08-26 22:17       ` Christoph Hellwig
  2009-08-27  9:00         ` Jamie Lokier
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2009-08-26 22:17 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: rusty, Javier Guerra, Christoph Hellwig, kvm, qemu-devel

On Wed, Aug 26, 2009 at 07:57:55PM +0100, Jamie Lokier wrote:
> Christoph Hellwig wrote:
> > > what about LVM? iv'e read somewhere that it used to just eat barriers
> > > used by XFS, making it less safe than simple partitions.
> > 
> > Oh, any additional layers open another by cans of worms.  On Linux until
> > very recently using LVM or software raid means only disabled
> > write caches are safe.
> 
> I believe that's still true except if there's more than one backing
> drive, so software RAID still isn't safe.  Did that change?

Yes, it did change.  That beeing said with the amount of bugs in
filesystems realted to write barriers my expectation for the RAID
and device mapper code is not too high.  I will recommend to keep
doing what people caring for their data have been doing since these
volatile write caches came up:  turn them off.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-26 22:17       ` Christoph Hellwig
@ 2009-08-27  9:00         ` Jamie Lokier
  0 siblings, 0 replies; 13+ messages in thread
From: Jamie Lokier @ 2009-08-27  9:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rusty, qemu-devel, kvm, Javier Guerra

Christoph Hellwig wrote:
> On Wed, Aug 26, 2009 at 07:57:55PM +0100, Jamie Lokier wrote:
> > Christoph Hellwig wrote:
> > > > what about LVM? iv'e read somewhere that it used to just eat barriers
> > > > used by XFS, making it less safe than simple partitions.
> > > 
> > > Oh, any additional layers open another by cans of worms.  On Linux until
> > > very recently using LVM or software raid means only disabled
> > > write caches are safe.
> > 
> > I believe that's still true except if there's more than one backing
> > drive, so software RAID still isn't safe.  Did that change?
> 
> Yes, it did change. 

> I will recommend to keep doing what people caring for their data
> have been doing since these volatile write caches came up: turn them
> off.

Unfortunately I tried that on a batch of 1000 or so embedded thingies
with ext3, and the write performance plummeted.  They are the same
thingies where I observed lack of barriers resulting in filesystem
corruption after power failure.  We really need barriers with ATA
disks to get decent write performance.

It's a good recommendation though.

> That being said with the amount of bugs in filesystems related to
> write barriers my expectation for the RAID and device mapper code is
> not too high.

Turning off volatile write cache does not provide commit integrity
with RAID.

RAID needs barriers to plug, drain and unplug the queues across all
backing devices in a coordinated manner quite apart from the volatile
write cache.  And then there's still that pesky problem of writes
which reach one disk and not it's parity disk.

Unfortunately turning off the volatile write caches could actually
make the timing window for failure worse, in the case of system crash
without power failure.

-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-25 18:11 [Qemu-devel] Notes on block I/O data integrity Christoph Hellwig
  2009-08-25 19:33 ` [Qemu-devel] " Javier Guerra
@ 2009-08-25 20:25 ` Nikola Ciprich
  2009-08-26 18:55   ` Jamie Lokier
  2009-08-27  0:15   ` Christoph Hellwig
  2009-08-27 10:51 ` Rusty Russell
  2009-08-27 14:09 ` [Qemu-devel] " Jamie Lokier
  3 siblings, 2 replies; 13+ messages in thread
From: Nikola Ciprich @ 2009-08-25 20:25 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: nikola.ciprich, kopi, rusty, qemu-devel, kvm

Hello Christopher,

thanks a lot vor this overview, it answers a lot of my questions!
May I suggest You put it somewhere on the wiki so it doesn't get 
forgotten in the maillist only?
It also rises few new questions though. We have experienced postgresql
database corruptions lately, two times to be exact. First time, I blamed
server crash, but lately (freshly created) database got corrupted for the 
second time and there were no crashes since the initialisation. The server
hardware is surely OK. I didn't have much time to look into this
yet, but Your mail just poked me to return to the subject. The situation
is a bit more complex, as there are additional two layers of storage there:
we're using SATA/SAS drives, network-mirrored by DRBD, clustered LVM on top
of those, and finally qemu-kvm using virtio on top of created logical
volumes. So there are plenty of possible culprits, but Your mention of virtio
unsafeness while using cache=writethrough (which is the default for drive 
types other then qcow) leads me to suspicion that this might be the reason of 
the problem. Databases are sensitive for requests reordering, so I guess
using virtio for postgres storage was quite stupid from me :(
So my question is, could You please advise me a bit on the storage
configuration? virtio performed much better then SCSI, but of course
data integrity is crucial, so would You suggest rather using SCSI?
DRBD doesn't have problem with barriers, clustered LVM SHOULD not 
have problems with it, as we're using just striped volumes, but I'll
check it to be sure. So is it safe for me to keep cache=writethrough
for the database volume?

thanks a lor in advance for any hints!

with best regards

nik

On Tue, Aug 25, 2009 at 08:11:20PM +0200, Christoph Hellwig wrote:
> As various people wanted to know how the various data integrity patches
> I've send out recently play together here's a small writeup on what
> issues we have in QEMU and how to fix it:
> 
> There are two major aspects of data integrity we need to care in the
> QEMU block I/O code:
> 
>  (1) stable data storage - we must be able to force data out of caches
>      onto the stable media, and we must get completion notification for it.
>  (2) request ordering - we must be able to make sure some I/O request
>      do not get reordered with other in-flight requests before or after
>      it.
> 
> Linux uses two related abstractions to implement the this (other operating
> system are probably similar)
> 
>  (1) a cache flush request that flushes the whole volatile write cache to
>      stable storage
>  (2) a barrier request, which
>       (a) is guaranteed to actually go all the way to stable storage
>       (b) does not reordered versus any requests before or after it
> 
> For disks not using volatile write caches the cache flush is a no-op,
> and barrier requests are implemented by draining the queue of
> outstanding requests before the barrier request, and only allowing new
> requests to proceed after it has finished.  Instead of the queue drain
> tag ordering could be used, but at this point that is not the case in
> Linux.
> 
> For disks using volatile write caches, the cache flush is implemented by
> a protocol specific request, and the the barrier request are implemented
> by performing cache flushes before and after the barrier request, in
> addition to the draining mentioned above.  The second cache flush can be
> replaced by setting the "Force Unit Access" bit on the barrier request 
> on modern disks.
> 
> 
> The above is supported by the QEMU emulated disks in the following way:
> 
>   - The IDE disk emulation implement the ATA WIN_FLUSH_CACHE/
>     WIN_FLUSH_CACHE_EXT commands to flush the drive cache, but does not
>     indicate a volatile write cache in the ATA IDENTIFY command.  Because
>     of that guests do no not actually send down cache flush request.  Linux
>     guests do however drain the I/O queues to guarantee ordering in absence
>     of volatile write caches.
>   - The SCSI disk emulation implements the SCSI SYNCHRONIZE_CACHE command,
>     and also advertises the write cache enabled bit.  This means Linux
>     sends down cache flush requests to implement barriers, and provides
>     sufficient queue draining.
>   - The virtio-blk driver does not implement any cache flush command.
>     And while there is a virtio-blk feature bit for barrier support
>     it is not support by virtio-blk.  Due to the lack of a cache flush
>     command it also is insufficient to implement the required data
>     integrity semantics.  Currently the virtio-blk Linux does not
>     advertise any form of barrier support, and we don't even get the
>     queue draining required for proper operation in a cache-less
>     environment.
> 
> The I/O from these front end drivers maps to different host kernel I/O
> patterns  depending on the cache= drive command line.  There are three
> choices for it:
> 
>  (a) cache=writethrough
>  (b) cache=writeback
>  (c) cache=none
> 
> (a) means all writes are synchronous (O_DSYNC), which means the host
>     kernel guarantees us that data is on stable storage once the I/O
>     request has completed.
>     In cache=writethrough mode the IDE and SCSI drivers are safe because
>     the queue is properly drained to guarantee I/O ordering.  Virtio-blk
>     is not safe due to the lack of queue draining.
> (b) means we use regular buffered writes and need a fsync/fdatasync to
>     actually guarantee that data is stable on disk.
>     In data=writeback mode on the SCSI emulation is safe as all others
>     miss the cache flush requests.
> (c) means we use direct I/O (O_DIRECT) to bypass the host cache and
>     perform direct dma to/from the I/O buffer in QEMU.  While direct I/O
>     bypasses the host cache it does not guarantee flushing of volatile
>     write caches in disks, nor completion of metadata operations in
>     filesystems (e.g. block allocations).
>     In data=none only the SCSI emulation is entirely safe right now
>     due to the lack of cache flushes in the other drivers.
> 
> 
> Action plan for the guest drivers:
> 
>  - virtio-blk needs to advertise ordered queue by default.
>    This makes cache=writethrough safe on virtio.
> 
> Action plan for QEMU:
> 
>  - IDE needs to set the write cache enabled bit
>  - virtio needs to implement a cache flush command and advertise it
>    (also needs a small change to the host driver)
>  - we need to implement an aio_fsync to not stall the vpu on cache
>    flushes
>  - investigate only advertising a write cache when we really have one
>    to avoid the cache flush requests for cache=writethrough
> 
> Notes on disk cache flushes on Linux hosts:
> 
>  - barrier requests and cache flushes are supported by all local
>    disk filesystem in popular use (btrfs, ext3, ext4, reiserfs, XFS).
>    However unlike the other filesystems ext3 does _NOT_ enable barriers
>    and cache flush requests by default.
>  - currently O_SYNC writes or fsync on block device nodes does not
>    flush the disk cache.
>  - currently none of the filesystems nor the direct access to the block
>    device nodes implements flushes of the disk caches when using
>    O_DIRECT|O_DSYNC or using fsync/fdatasync after an O_DIRECT request.
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

-- 
-------------------------------------
Nikola CIPRICH
LinuxBox.cz, s.r.o.
28. rijna 168, 709 01 Ostrava

tel.:   +420 596 603 142
fax:    +420 596 621 273
mobil:  +420 777 093 799

www.linuxbox.cz

mobil servis: +420 737 238 656
email servis: servis@linuxbox.cz
-------------------------------------

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-25 20:25 ` Nikola Ciprich
@ 2009-08-26 18:55   ` Jamie Lokier
  2009-08-27  0:15   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Jamie Lokier @ 2009-08-26 18:55 UTC (permalink / raw)
  To: Nikola Ciprich
  Cc: kopi, kvm, rusty, qemu-devel, nikola.ciprich, Christoph Hellwig

Nikola Ciprich wrote:
> clustered LVM SHOULD not have problems with it, as we're using just
> striped volumes,

Note that LVM does not implement barriers at all, except for simple
cases of a single backing device (I'm not sure if that includes
dm-crypt).

So your striped volumes may not offer this level of integrity.

-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-25 20:25 ` Nikola Ciprich
  2009-08-26 18:55   ` Jamie Lokier
@ 2009-08-27  0:15   ` Christoph Hellwig
  1 sibling, 0 replies; 13+ messages in thread
From: Christoph Hellwig @ 2009-08-27  0:15 UTC (permalink / raw)
  To: Nikola Ciprich
  Cc: kopi, kvm, rusty, qemu-devel, nikola.ciprich, Christoph Hellwig

[-- Attachment #1: Type: text/plain, Size: 2238 bytes --]

On Tue, Aug 25, 2009 at 10:25:08PM +0200, Nikola Ciprich wrote:
> Hello Christopher,
> 
> thanks a lot vor this overview, it answers a lot of my questions!
> May I suggest You put it somewhere on the wiki so it doesn't get 
> forgotten in the maillist only?

I'll rather try to get the worst issues fixed ASAP.

> It also rises few new questions though. We have experienced postgresql
> database corruptions lately, two times to be exact. First time, I blamed
> server crash, but lately (freshly created) database got corrupted for the 
> second time and there were no crashes since the initialisation. The server
> hardware is surely OK. I didn't have much time to look into this
> yet, but Your mail just poked me to return to the subject. The situation
> is a bit more complex, as there are additional two layers of storage there:
> we're using SATA/SAS drives, network-mirrored by DRBD, clustered LVM on top
> of those, and finally qemu-kvm using virtio on top of created logical
> volumes. So there are plenty of possible culprits, but Your mention of virtio
> unsafeness while using cache=writethrough (which is the default for drive 
> types other then qcow) leads me to suspicion that this might be the reason of 
> the problem. Databases are sensitive for requests reordering, so I guess
> using virtio for postgres storage was quite stupid from me :(
> So my question is, could You please advise me a bit on the storage
> configuration? virtio performed much better then SCSI, but of course
> data integrity is crucial, so would You suggest rather using SCSI?
> DRBD doesn't have problem with barriers, clustered LVM SHOULD not 
> have problems with it, as we're using just striped volumes, but I'll
> check it to be sure. So is it safe for me to keep cache=writethrough
> for the database volume?

I'm pretty sure one of the many laters in your setup will not pass
through write barriers, so defintively make sure your write caches are
disabled.  Also right now virtio is not a good idea for data integrity.
The guest side fix for a setup with cache=writethrough or cache=none
on block device without volatile disk write cache is however a trivial
one line patch I've already submitted.  I've attached it below for
reference:


[-- Attachment #2: virtio-blk-drain --]
[-- Type: text/plain, Size: 1896 bytes --]

Subject: [PATCH] virtio-blk: set QUEUE_ORDERED_DRAIN by default
From: Christoph Hellwig <hch@lst.de>

Currently virtio-blk doesn't set any QUEUE_ORDERED_ flag by default, which
means it does not allow filesystems to use barriers.  But the typical use
case for virtio-blk is to use a backed that uses synchronous I/O, and in
that case we can simply set QUEUE_ORDERED_DRAIN to make the block layer
drain the request queue around barrier I/O and provide the semantics that
the filesystems need.  This is what the SCSI disk driver does for disks
that have the write cache disabled.

With this patch we incorrectly advertise barrier support if someone
configure qemu with write back caching.  While this displays wrong
information in the guest there is nothing that guest could have done
even if we rightfully told it that we do not support any barriers.


Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/drivers/block/virtio_blk.c
===================================================================
--- linux-2.6.orig/drivers/block/virtio_blk.c	2009-08-20 17:41:37.019718433 -0300
+++ linux-2.6/drivers/block/virtio_blk.c	2009-08-20 17:45:40.511747922 -0300
@@ -336,9 +336,16 @@ static int __devinit virtblk_probe(struc
 	vblk->disk->driverfs_dev = &vdev->dev;
 	index++;
 
-	/* If barriers are supported, tell block layer that queue is ordered */
+	/*
+	 * If barriers are supported, tell block layer that queue is ordered.
+	 *
+	 * If no barriers are supported assume the host uses synchronous
+	 * writes and just drain the the queue before and after the barrier.
+	 */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_BARRIER))
 		blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_TAG, NULL);
+	else
+		blk_queue_ordered(vblk->disk->queue, QUEUE_ORDERED_DRAIN, NULL);
 
 	/* If disk is read-only in the host, the guest should obey */
 	if (virtio_has_feature(vdev, VIRTIO_BLK_F_RO))

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-25 18:11 [Qemu-devel] Notes on block I/O data integrity Christoph Hellwig
  2009-08-25 19:33 ` [Qemu-devel] " Javier Guerra
  2009-08-25 20:25 ` Nikola Ciprich
@ 2009-08-27 10:51 ` Rusty Russell
  2009-08-27 13:42   ` Christoph Hellwig
  2009-08-27 14:09 ` [Qemu-devel] " Jamie Lokier
  3 siblings, 1 reply; 13+ messages in thread
From: Rusty Russell @ 2009-08-27 10:51 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel, kvm

On Wed, 26 Aug 2009 03:41:20 am Christoph Hellwig wrote:
> As various people wanted to know how the various data integrity patches
> I've send out recently play together here's a small writeup on what
> issues we have in QEMU and how to fix it:

Classic mail.  Thanks for the massive and coherent clue injection!

> Action plan for the guest drivers:
> 
>  - virtio-blk needs to advertise ordered queue by default.
>    This makes cache=writethrough safe on virtio.

From a guest POV, that's "we don't know, let's say we're ordered because that
may make us safer".  Of course, it may not help: how much does it cost to
drain the queue?

The bug, IMHO is that we *should* know.  And in future I'd like to fix that,
either by adding an VIRTIO_BLK_F_ORDERED feature, or a VIRTIO_BLK_F_UNORDERED
feature.

> Action plan for QEMU:
> 
>  - IDE needs to set the write cache enabled bit
>  - virtio needs to implement a cache flush command and advertise it
>    (also needs a small change to the host driver)

So, virtio-blk needs to be enhanced for this as well.

Thanks!
Rusty.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-27 10:51 ` Rusty Russell
@ 2009-08-27 13:42   ` Christoph Hellwig
  2009-08-28  2:03     ` Rusty Russell
  0 siblings, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2009-08-27 13:42 UTC (permalink / raw)
  To: Rusty Russell; +Cc: Christoph Hellwig, kvm, qemu-devel

On Thu, Aug 27, 2009 at 08:21:55PM +0930, Rusty Russell wrote:
> >  - virtio-blk needs to advertise ordered queue by default.
> >    This makes cache=writethrough safe on virtio.
> 
> >From a guest POV, that's "we don't know, let's say we're ordered because that
> may make us safer".  Of course, it may not help: how much does it cost to
> drain the queue?
> 
> The bug, IMHO is that we *should* know.  And in future I'd like to fix that,
> either by adding an VIRTIO_BLK_F_ORDERED feature, or a VIRTIO_BLK_F_UNORDERED
> feature.
> 
> > Action plan for QEMU:
> > 
> >  - IDE needs to set the write cache enabled bit
> >  - virtio needs to implement a cache flush command and advertise it
> >    (also needs a small change to the host driver)
> 
> So, virtio-blk needs to be enhanced for this as well.

Really, enabling volatile write caches without advertising a cache flush
command is a bug in the storage, where in our case qemu is the storage.
So I don't really see the need for two feature bits.  Here's my plan for
virtio-blk:


 - add a new VIRTIO_BLK_F_WCACHE feature.  If this feature is set we
   do
     (a) implement the prepare_flush queue operation to send a
         standalone cache flush
     (b) set a proper barrier ordering flag on the queue

	Now I'm not entirely sure which queue ordering feature we will
	use.  It is not going to be QUEUE_ORDERED_TAG as for
	VIRTIO_BLK_F_BARRIER as that leaves all the queue draining to
	the host.  Which for everything that uses something resembling
	Posix I/O as a backed and has more than one outstanding command
	at a time just means duplicating all the queue management we
	already do in the guest for no gain.
	The easiest one would be QUEUE_ORDERED_DRAIN_FLUSH, in which
	case the cache flush command really is everything we need.
	As a slight optimization of it we could make it
	QUEUE_ORDERED_DRAIN_FUA which still does all the queue draining
	in the guest, but only sends one explicit cache flush before the
	barrier and gthen sets the FUA bit on the actual barrier
	request.  In qemu we still would implement this as fdatasync
	before and after the request, but we would save one protocol
	roundtrip.

Now the big question is when do we set the VIRTIO_BLK_F_WCACHE feature.
The proper thing to do would be to set it for cache=writeback and
cache=none, because they do need the fdatasync, and not for
cache=writethrough because it does not require it.

Now Avi is a big advocate for the cache=writethrough should mean go fast
and loose and don't care about data integrity.  There's a certain point
to that as I don't really see a good use case for that mode, but I
really hate to make something unsafe that doesn't explicitly say so
in the option name.

The complex (not to say over engineered) verison would be to split
the caching and data integrity setting into two options:


 (1) hostcache=on|off
 	use buffered vs O_DIRECT I/O
 (2) integrity=osync|fsync|none
 	use O_SYNC, use f(data)sync or do not care about data integrity

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [Qemu-devel] Re: Notes on block I/O data integrity
  2009-08-27 13:42   ` Christoph Hellwig
@ 2009-08-28  2:03     ` Rusty Russell
  0 siblings, 0 replies; 13+ messages in thread
From: Rusty Russell @ 2009-08-28  2:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: qemu-devel, kvm

On Thu, 27 Aug 2009 11:12:39 pm Christoph Hellwig wrote:
> On Thu, Aug 27, 2009 at 08:21:55PM +0930, Rusty Russell wrote:
> > >  - virtio-blk needs to advertise ordered queue by default.
> > >    This makes cache=writethrough safe on virtio.
> > 
> > >From a guest POV, that's "we don't know, let's say we're ordered because that
> > may make us safer".  Of course, it may not help: how much does it cost to
> > drain the queue?
> > 
> > The bug, IMHO is that we *should* know.  And in future I'd like to fix that,
> > either by adding an VIRTIO_BLK_F_ORDERED feature, or a VIRTIO_BLK_F_UNORDERED
> > feature.
> > 
> > > Action plan for QEMU:
> > > 
> > >  - IDE needs to set the write cache enabled bit
> > >  - virtio needs to implement a cache flush command and advertise it
> > >    (also needs a small change to the host driver)
> > 
> > So, virtio-blk needs to be enhanced for this as well.
> 
> Really, enabling volatile write caches without advertising a cache flush
> command is a bug in the storage, where in our case qemu is the storage.
> So I don't really see the need for two feature bits.  Here's my plan for
> virtio-blk:
> 
>  - add a new VIRTIO_BLK_F_WCACHE feature.  If this feature is set we
>    do
>      (a) implement the prepare_flush queue operation to send a
>          standalone cache flush
>      (b) set a proper barrier ordering flag on the queue

OK, I buy that.  I'll update the virtio_pci spec accordingly, too.

I've applied your previous patch.

> The complex (not to say over engineered) verison would be to split
> the caching and data integrity setting into two options:
> 
>  (1) hostcache=on|off
>  	use buffered vs O_DIRECT I/O
>  (2) integrity=osync|fsync|none
>  	use O_SYNC, use f(data)sync or do not care about data integrity

If we were starting from scratch, I'd agree.  But seems like too much
user-visible churn.

Thanks,
Rusty.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Qemu-devel] Notes on block I/O data integrity
  2009-08-25 18:11 [Qemu-devel] Notes on block I/O data integrity Christoph Hellwig
                   ` (2 preceding siblings ...)
  2009-08-27 10:51 ` Rusty Russell
@ 2009-08-27 14:09 ` Jamie Lokier
  3 siblings, 0 replies; 13+ messages in thread
From: Jamie Lokier @ 2009-08-27 14:09 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: rusty, qemu-devel, kvm

Christoph Hellwig wrote:
> As various people wanted to know how the various data integrity patches
> I've send out recently play together here's a small writeup on what
> issues we have in QEMU and how to fix it:

Thanks for taking this on.  Both this email and the one on
linux-fsdevel about Linux behaviour are wonderfully clear summaries of
the issues.

> Action plan for QEMU:
>
>  - IDE needs to set the write cache enabled bit
>  - virtio needs to implement a cache flush command and advertise it
>    (also needs a small change to the host driver)

With IDE and SCSI, and perhaps virtio-blk, guests should also be able
to disable the "write cache enabled" bit, and that should be
equivalent to the guest issuing a cache flush command after every
write.

At the host it could be implemented as if every write were followed by
flush, or by switching to O_DSYNC (cache=writethrough) in response.

The other way around: for guests where integrity isn't required
(e.g. disposable guests for testing - or speed during guest OS
installs), you might want an option to ignore cache flush commands -
just let the guest *think* it's committing to disk, but don't waste
time doing that on the host.

> For disks using volatile write caches, the cache flush is implemented by
> a protocol specific request, and the the barrier request are implemented
> by performing cache flushes before and after the barrier request, in
> addition to the draining mentioned above.  The second cache flush can be
> replaced by setting the "Force Unit Access" bit on the barrier request 
> on modern disks.

For fdatasync (etc), you've probably noticed that it only needs one
cache flush by itself, no second request or FUA write.

Less obviously, there are opportunities to merge and reorder around
non-barrier flush requests in the elevator, and to eliminate redundant
flush requests.

Also you don't need flushes to reach every backing drive on RAID, but
knowing which ones to leave out is tricky and needs more hints from
the filesystem.

I agree with the whole of your general plan, both in QEMU and in Linux
as a host.  Spot on!

-- Jamie

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2009-08-28  2:04 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-25 18:11 [Qemu-devel] Notes on block I/O data integrity Christoph Hellwig
2009-08-25 19:33 ` [Qemu-devel] " Javier Guerra
2009-08-25 19:36   ` Christoph Hellwig
2009-08-26 18:57     ` Jamie Lokier
2009-08-26 22:17       ` Christoph Hellwig
2009-08-27  9:00         ` Jamie Lokier
2009-08-25 20:25 ` Nikola Ciprich
2009-08-26 18:55   ` Jamie Lokier
2009-08-27  0:15   ` Christoph Hellwig
2009-08-27 10:51 ` Rusty Russell
2009-08-27 13:42   ` Christoph Hellwig
2009-08-28  2:03     ` Rusty Russell
2009-08-27 14:09 ` [Qemu-devel] " Jamie Lokier

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).