qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
@ 2006-07-28 19:54 Rik van Riel
  2006-07-28 19:58 ` [Qemu-devel] " Rik van Riel
                   ` (2 more replies)
  0 siblings, 3 replies; 23+ messages in thread
From: Rik van Riel @ 2006-07-28 19:54 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 718 bytes --]

This is the simple approach to making sure that disk writes actually
hit disk before we tell the guest OS that IO has completed.  Thanks
to DMA_MULTI_THREAD the performance still seems to be adequate.

A fancier solution would be to make the sync/non-sync behaviour of
the qemu disk backing store tunable from the guest OS, by tuning
the IDE disk write cache on/off with hdparm, and having hw/ide.c
call ->fsync functions in the block backends.

I'm willing to code up the fancy solution if people prefer that.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

[-- Attachment #2: xen-hvm-osync.patch --]
[-- Type: text/x-patch, Size: 3160 bytes --]

Make sure disk writes really made it to disk before we report I/O
completion to the guest domain.  The DMA_MULTI_THREAD functionality
from the qemu-dm IDE emulation should make the performance overhead
of synchronous writes bearable, or at least comparable to native
hardware.

Signed-off-by: Rik van Riel <riel@redhat.com>

--- xen-unstable-10712/tools/ioemu/block-bochs.c.osync	2006-07-28 02:15:56.000000000 -0400
+++ xen-unstable-10712/tools/ioemu/block-bochs.c	2006-07-28 02:21:08.000000000 -0400
@@ -91,7 +91,7 @@
     int fd, i;
     struct bochs_header bochs;
 
-    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
+    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
     if (fd < 0) {
         fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
         if (fd < 0)
--- xen-unstable-10712/tools/ioemu/block.c.osync	2006-07-28 02:15:56.000000000 -0400
+++ xen-unstable-10712/tools/ioemu/block.c	2006-07-28 02:19:27.000000000 -0400
@@ -677,7 +677,7 @@
     int rv;
 #endif
 
-    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
+    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
     if (fd < 0) {
         fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
         if (fd < 0)
--- xen-unstable-10712/tools/ioemu/block-cloop.c.osync	2006-07-28 02:15:56.000000000 -0400
+++ xen-unstable-10712/tools/ioemu/block-cloop.c	2006-07-28 02:17:13.000000000 -0400
@@ -55,7 +55,7 @@
     BDRVCloopState *s = bs->opaque;
     uint32_t offsets_size,max_compressed_block_size=1,i;
 
-    s->fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
+    s->fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE | O_SYNC);
     if (s->fd < 0)
         return -1;
     bs->read_only = 1;
--- xen-unstable-10712/tools/ioemu/block-cow.c.osync	2006-07-28 02:15:56.000000000 -0400
+++ xen-unstable-10712/tools/ioemu/block-cow.c	2006-07-28 02:21:34.000000000 -0400
@@ -69,7 +69,7 @@
     struct cow_header_v2 cow_header;
     int64_t size;
 
-    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
+    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
     if (fd < 0) {
         fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
         if (fd < 0)
--- xen-unstable-10712/tools/ioemu/block-qcow.c.osync	2006-07-28 02:15:56.000000000 -0400
+++ xen-unstable-10712/tools/ioemu/block-qcow.c	2006-07-28 02:20:05.000000000 -0400
@@ -95,7 +95,7 @@
     int fd, len, i, shift;
     QCowHeader header;
     
-    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
+    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
     if (fd < 0) {
         fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
         if (fd < 0)
--- xen-unstable-10712/tools/ioemu/block-vmdk.c.osync	2006-07-28 02:15:56.000000000 -0400
+++ xen-unstable-10712/tools/ioemu/block-vmdk.c	2006-07-28 02:20:20.000000000 -0400
@@ -96,7 +96,7 @@
     uint32_t magic;
     int l1_size;
 
-    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
+    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
     if (fd < 0) {
         fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
         if (fd < 0)

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Qemu-devel] Re: [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-28 19:54 [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk Rik van Riel
@ 2006-07-28 19:58 ` Rik van Riel
  2006-07-28 20:12 ` Anthony Liguori
  2006-07-29  9:57 ` [Qemu-devel] " Fabrice Bellard
  2 siblings, 0 replies; 23+ messages in thread
From: Rik van Riel @ 2006-07-28 19:58 UTC (permalink / raw)
  To: qemu-devel

Rik van Riel wrote:
> This is the simple approach to making sure that disk writes actually
> hit disk before we tell the guest OS that IO has completed.  Thanks
> to DMA_MULTI_THREAD the performance still seems to be adequate.

Hah, and of course that bit is only found in Xen's qemu-dm. Doh!

I knew I should have also checked some of the files my patch didn't
touch :)

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Qemu-devel] Re: [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-28 19:54 [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk Rik van Riel
  2006-07-28 19:58 ` [Qemu-devel] " Rik van Riel
@ 2006-07-28 20:12 ` Anthony Liguori
  2006-07-28 20:18   ` Rik van Riel
  2006-07-29  9:57 ` [Qemu-devel] " Fabrice Bellard
  2 siblings, 1 reply; 23+ messages in thread
From: Anthony Liguori @ 2006-07-28 20:12 UTC (permalink / raw)
  To: qemu-devel

On Fri, 28 Jul 2006 15:54:30 -0400, Rik van Riel wrote:

> This is the simple approach to making sure that disk writes actually hit
> disk before we tell the guest OS that IO has completed.  Thanks to
> DMA_MULTI_THREAD the performance still seems to be adequate.

Hi Rik,

Right now Fabrice is working on rewriting the block API to be
asynchronous.  There's been quite a lot of discussion about why using
threads isn't a good idea for this (I wish Xen wouldn't use this patch but
that's another conversation :-)).

The async block API will allow the use of different kinds of async
"backends".  The default (on Linux) will be posix-aio.  I'm currently
working on an HTTP backend and will also write a linux-aio (which, of
course, will be using O_DIRECT).

> A fancier solution would be to make the sync/non-sync behaviour of the
> qemu disk backing store tunable from the guest OS, by tuning the IDE disk
> write cache on/off with hdparm, and having hw/ide.c call ->fsync functions
> in the block backends.

With a proper async API, is there any reason why we would want this to be
tunable?  I don't think there's much of a benefit of prematurely claiming
a write is complete especially once the SCSI emulation can support
multiple simultaneous requests.

I was hoping to just make linux-aio the default if it was available...

Regards,

Anthony Liguori

> I'm willing to code up the fancy solution if people prefer that.

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-28 20:12 ` Anthony Liguori
@ 2006-07-28 20:18   ` Rik van Riel
  2006-07-28 20:30     ` Paul Brook
  2006-07-31  7:08     ` Jens Axboe
  0 siblings, 2 replies; 23+ messages in thread
From: Rik van Riel @ 2006-07-28 20:18 UTC (permalink / raw)
  To: qemu-devel; +Cc: alan

Anthony Liguori wrote:

> Right now Fabrice is working on rewriting the block API to be
> asynchronous.  There's been quite a lot of discussion about why using
> threads isn't a good idea for this

Agreed, AIO is the way to go in the long run.

> With a proper async API, is there any reason why we would want this to be
> tunable?  I don't think there's much of a benefit of prematurely claiming
> a write is complete especially once the SCSI emulation can support
> multiple simultaneous requests.

You're right.  This O_SYNC bandaid should probably stay in place
to prevent data corruption, until the AIO framework is ready to
be used.

No sense investing too much time in a fancier band-aid.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-28 20:18   ` Rik van Riel
@ 2006-07-28 20:30     ` Paul Brook
  2006-07-28 20:43       ` Rik van Riel
  2006-07-31  7:08     ` Jens Axboe
  1 sibling, 1 reply; 23+ messages in thread
From: Paul Brook @ 2006-07-28 20:30 UTC (permalink / raw)
  To: qemu-devel; +Cc: alan

> > With a proper async API, is there any reason why we would want this to be
> > tunable?  I don't think there's much of a benefit of prematurely claiming
> > a write is complete especially once the SCSI emulation can support
> > multiple simultaneous requests.
>
> You're right.  This O_SYNC bandaid should probably stay in place
> to prevent data corruption, until the AIO framework is ready to
> be used.

It's arguable whether O_SYNC is needed at all. Qemu doesn't claim data is 
written to disk, and provides facilities for the guest OS to flush the cache, 
just like real hardware does.

Have you measured the impact of O_SYNC? I wouldn't be surprised if it was 
significant.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-28 20:30     ` Paul Brook
@ 2006-07-28 20:43       ` Rik van Riel
  2006-07-28 21:01         ` Paul Brook
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2006-07-28 20:43 UTC (permalink / raw)
  To: Paul Brook; +Cc: alan, qemu-devel

Paul Brook wrote:
>>> With a proper async API, is there any reason why we would want this to be
>>> tunable?  I don't think there's much of a benefit of prematurely claiming
>>> a write is complete especially once the SCSI emulation can support
>>> multiple simultaneous requests.
>> You're right.  This O_SYNC bandaid should probably stay in place
>> to prevent data corruption, until the AIO framework is ready to
>> be used.
> 
> It's arguable whether O_SYNC is needed at all. Qemu doesn't claim data is 
> written to disk, and provides facilities for the guest OS to flush the cache, 
> just like real hardware does.

Nice.  Another difference between the qemu codebase and the qemu-dm
codebase used by Xen.

With the bdrv_flush stuff in place, it should even be easy for qemu
to actually do something when the guest OS switches disk write caching
off (currently that is a noop in the qemu code base).

> Have you measured the impact of O_SYNC? I wouldn't be surprised if it was 
> significant.

I suspect it'll be horrific in the qemu codebase (blocking execution
of the guest OS until disk IO is complete), but it's fine in the Xen
qemu-dm situation, where IO completion happens asynchronously.

The recent commit message on the Xen side did not suggest there was
that much of a difference between both qemu code bases.  Obviously
I was wrong, and the O_SYNC bandaid should probably be kept out for
now.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-28 20:43       ` Rik van Riel
@ 2006-07-28 21:01         ` Paul Brook
  0 siblings, 0 replies; 23+ messages in thread
From: Paul Brook @ 2006-07-28 21:01 UTC (permalink / raw)
  To: Rik van Riel; +Cc: alan, qemu-devel

> > Have you measured the impact of O_SYNC? I wouldn't be surprised if it was
> > significant.
>
> I suspect it'll be horrific in the qemu codebase (blocking execution
> of the guest OS until disk IO is complete), but it's fine in the Xen
> qemu-dm situation, where IO completion happens asynchronously.
>
> The recent commit message on the Xen side did not suggest there was
> that much of a difference between both qemu code bases.  Obviously
> I was wrong, and the O_SYNC bandaid should probably be kept out for
> now.

Ah, ok. I didn't realise they'd diverged that much either.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-28 19:54 [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk Rik van Riel
  2006-07-28 19:58 ` [Qemu-devel] " Rik van Riel
  2006-07-28 20:12 ` Anthony Liguori
@ 2006-07-29  9:57 ` Fabrice Bellard
  2006-07-29 14:59   ` Rik van Riel
  2 siblings, 1 reply; 23+ messages in thread
From: Fabrice Bellard @ 2006-07-29  9:57 UTC (permalink / raw)
  To: qemu-devel

Hi,

Using O_SYNC for disk image access is not acceptable: QEMU relies on the 
host OS to ensure that the data is written correctly. Even the current 
'fsync' support is questionnable to say the least !

Please don't mix issues regarding QEMU disk handling and the underlying 
hypervisor/host OS block device handling.

Regards,

Fabrice.

Rik van Riel wrote:
> This is the simple approach to making sure that disk writes actually
> hit disk before we tell the guest OS that IO has completed.  Thanks
> to DMA_MULTI_THREAD the performance still seems to be adequate.
> 
> A fancier solution would be to make the sync/non-sync behaviour of
> the qemu disk backing store tunable from the guest OS, by tuning
> the IDE disk write cache on/off with hdparm, and having hw/ide.c
> call ->fsync functions in the block backends.
> 
> I'm willing to code up the fancy solution if people prefer that.
> 
> 
> ------------------------------------------------------------------------
> 
> Make sure disk writes really made it to disk before we report I/O
> completion to the guest domain.  The DMA_MULTI_THREAD functionality
> from the qemu-dm IDE emulation should make the performance overhead
> of synchronous writes bearable, or at least comparable to native
> hardware.
> 
> Signed-off-by: Rik van Riel <riel@redhat.com>
> 
> --- xen-unstable-10712/tools/ioemu/block-bochs.c.osync	2006-07-28 02:15:56.000000000 -0400
> +++ xen-unstable-10712/tools/ioemu/block-bochs.c	2006-07-28 02:21:08.000000000 -0400
> @@ -91,7 +91,7 @@
>      int fd, i;
>      struct bochs_header bochs;
>  
> -    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
> +    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
>      if (fd < 0) {
>          fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
>          if (fd < 0)
> --- xen-unstable-10712/tools/ioemu/block.c.osync	2006-07-28 02:15:56.000000000 -0400
> +++ xen-unstable-10712/tools/ioemu/block.c	2006-07-28 02:19:27.000000000 -0400
> @@ -677,7 +677,7 @@
>      int rv;
>  #endif
>  
> -    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
> +    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
>      if (fd < 0) {
>          fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
>          if (fd < 0)
> --- xen-unstable-10712/tools/ioemu/block-cloop.c.osync	2006-07-28 02:15:56.000000000 -0400
> +++ xen-unstable-10712/tools/ioemu/block-cloop.c	2006-07-28 02:17:13.000000000 -0400
> @@ -55,7 +55,7 @@
>      BDRVCloopState *s = bs->opaque;
>      uint32_t offsets_size,max_compressed_block_size=1,i;
>  
> -    s->fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
> +    s->fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE | O_SYNC);
>      if (s->fd < 0)
>          return -1;
>      bs->read_only = 1;
> --- xen-unstable-10712/tools/ioemu/block-cow.c.osync	2006-07-28 02:15:56.000000000 -0400
> +++ xen-unstable-10712/tools/ioemu/block-cow.c	2006-07-28 02:21:34.000000000 -0400
> @@ -69,7 +69,7 @@
>      struct cow_header_v2 cow_header;
>      int64_t size;
>  
> -    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
> +    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
>      if (fd < 0) {
>          fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
>          if (fd < 0)
> --- xen-unstable-10712/tools/ioemu/block-qcow.c.osync	2006-07-28 02:15:56.000000000 -0400
> +++ xen-unstable-10712/tools/ioemu/block-qcow.c	2006-07-28 02:20:05.000000000 -0400
> @@ -95,7 +95,7 @@
>      int fd, len, i, shift;
>      QCowHeader header;
>      
> -    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
> +    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
>      if (fd < 0) {
>          fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
>          if (fd < 0)
> --- xen-unstable-10712/tools/ioemu/block-vmdk.c.osync	2006-07-28 02:15:56.000000000 -0400
> +++ xen-unstable-10712/tools/ioemu/block-vmdk.c	2006-07-28 02:20:20.000000000 -0400
> @@ -96,7 +96,7 @@
>      uint32_t magic;
>      int l1_size;
>  
> -    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE);
> +    fd = open(filename, O_RDWR | O_BINARY | O_LARGEFILE | O_SYNC);
>      if (fd < 0) {
>          fd = open(filename, O_RDONLY | O_BINARY | O_LARGEFILE);
>          if (fd < 0)
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29  9:57 ` [Qemu-devel] " Fabrice Bellard
@ 2006-07-29 14:59   ` Rik van Riel
  2006-07-29 16:04     ` Paul Brook
                       ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Rik van Riel @ 2006-07-29 14:59 UTC (permalink / raw)
  To: qemu-devel

Fabrice Bellard wrote:
> Hi,
> 
> Using O_SYNC for disk image access is not acceptable: QEMU relies on the 
> host OS to ensure that the data is written correctly.

This means that write ordering is not preserved, and on a power
failure any data written by qemu (or Xen fully virt) guests may
not be preserved.

Applications running on the host can count on fsync doing the
right thing, meaning that if they call fsync, the data *will*
have made it to disk.  Applications running inside a guest have
no guarantees that their data is actually going to make it
anywhere when fsync returns...

This may look like hair splitting, but so far I've lost a
(test) postgresql database to this 3 times already.  Not getting
the guest application's data to disk when the application calls
fsync is a recipe for disaster.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29 14:59   ` Rik van Riel
@ 2006-07-29 16:04     ` Paul Brook
  2006-07-29 16:22       ` Rik van Riel
  2006-07-29 17:33     ` Bill C. Riemers
                       ` (2 subsequent siblings)
  3 siblings, 1 reply; 23+ messages in thread
From: Paul Brook @ 2006-07-29 16:04 UTC (permalink / raw)
  To: qemu-devel

On Saturday 29 July 2006 15:59, Rik van Riel wrote:
> Fabrice Bellard wrote:
> > Hi,
> >
> > Using O_SYNC for disk image access is not acceptable: QEMU relies on the
> > host OS to ensure that the data is written correctly.
>
> This means that write ordering is not preserved, and on a power
> failure any data written by qemu (or Xen fully virt) guests may
> not be preserved.

I might be willing to accept this (or similar) patch if you made it 
conditional on the guest having disabled write caching. I agree with Fabrice 
that the performance impact it too severe to consider turning it on by 
default. 

The same problems occurs with many hardware RAID controllers, and even many 
harddrives: fsync() only guarantees that the data has been passed to the 
controller (in this case the host OS). If you need absolute reliability you 
either need more flusing in your guest OS, disable the write cache, or 
battery backup to make sure the IDE hardware (ie. host OS) doesn't die 
unexpectedly.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29 16:04     ` Paul Brook
@ 2006-07-29 16:22       ` Rik van Riel
  2006-07-29 16:31         ` Paul Brook
  0 siblings, 1 reply; 23+ messages in thread
From: Rik van Riel @ 2006-07-29 16:22 UTC (permalink / raw)
  To: Paul Brook; +Cc: qemu-devel

Paul Brook wrote:
> On Saturday 29 July 2006 15:59, Rik van Riel wrote:
>> Fabrice Bellard wrote:
>>> Hi,
>>>
>>> Using O_SYNC for disk image access is not acceptable: QEMU relies on the
>>> host OS to ensure that the data is written correctly.
>> This means that write ordering is not preserved, and on a power
>> failure any data written by qemu (or Xen fully virt) guests may
>> not be preserved.
> 
> I might be willing to accept this (or similar) patch if you made it 
> conditional on the guest having disabled write caching. I agree with Fabrice 
> that the performance impact it too severe to consider turning it on by 
> default. 

Easy to do with the fsync infrastructure, but probably not worth
doing since people are working on the AIO I/O backend, which would
allow multiple outstanding writes from a guest.  That, in turn,
means I/O completion in the guest can be done when the data really
hits disk, but without a performance impact.

-- 
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29 16:22       ` Rik van Riel
@ 2006-07-29 16:31         ` Paul Brook
  2006-07-31  7:08           ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Paul Brook @ 2006-07-29 16:31 UTC (permalink / raw)
  To: Rik van Riel; +Cc: qemu-devel

> Easy to do with the fsync infrastructure, but probably not worth
> doing since people are working on the AIO I/O backend, which would
> allow multiple outstanding writes from a guest.  That, in turn,
> means I/O completion in the guest can be done when the data really
> hits disk, but without a performance impact.

Not entirely true. That only works if you allow multiple guest IO requests in 
parallel, ie. some form of tagged command queueing. This requires either 
improving the SCSI emulation, or implementing SATA emulation. AFAIK parallel 
IDE doesn't support command queueing.

My impression what that the initial AIO implementation is just straight serial 
async operation. IO wouldn't actually go any faster, it just means the guest 
can do something else while it's waiting.

Paul

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29 14:59   ` Rik van Riel
  2006-07-29 16:04     ` Paul Brook
@ 2006-07-29 17:33     ` Bill C. Riemers
  2006-07-30 21:47       ` Jamie Lokier
  2006-07-30 21:41     ` Jamie Lokier
  2006-07-31  7:08     ` Jens Axboe
  3 siblings, 1 reply; 23+ messages in thread
From: Bill C. Riemers @ 2006-07-29 17:33 UTC (permalink / raw)
  To: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1629 bytes --]

How about compromising, and making the patch a run time option.  Presumably
this is only a problem when the virtual machine is not properly shutdown.
For those ho want the extra security of knowing the data will be written
regardless of the shutdown status they can enable the flag.  By default it
could be turned off.  Then everybody can be happy.

Bill


On 7/29/06, Rik van Riel <riel@redhat.com> wrote:
>
> Fabrice Bellard wrote:
> > Hi,
> >
> > Using O_SYNC for disk image access is not acceptable: QEMU relies on the
> > host OS to ensure that the data is written correctly.
>
> This means that write ordering is not preserved, and on a power
> failure any data written by qemu (or Xen fully virt) guests may
> not be preserved.
>
> Applications running on the host can count on fsync doing the
> right thing, meaning that if they call fsync, the data *will*
> have made it to disk.  Applications running inside a guest have
> no guarantees that their data is actually going to make it
> anywhere when fsync returns...
>
> This may look like hair splitting, but so far I've lost a
> (test) postgresql database to this 3 times already.  Not getting
> the guest application's data to disk when the application calls
> fsync is a recipe for disaster.
>
> --
> "Debugging is twice as hard as writing the code in the first place.
> Therefore, if you write the code as cleverly as possible, you are,
> by definition, not smart enough to debug it." - Brian W. Kernighan
>
>
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
>

[-- Attachment #2: Type: text/html, Size: 2121 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29 14:59   ` Rik van Riel
  2006-07-29 16:04     ` Paul Brook
  2006-07-29 17:33     ` Bill C. Riemers
@ 2006-07-30 21:41     ` Jamie Lokier
  2006-07-31  9:52       ` andrzej zaborowski
  2006-07-31  7:08     ` Jens Axboe
  3 siblings, 1 reply; 23+ messages in thread
From: Jamie Lokier @ 2006-07-30 21:41 UTC (permalink / raw)
  To: qemu-devel

Rik van Riel wrote:
> This may look like hair splitting, but so far I've lost a
> (test) postgresql database to this 3 times already.  Not getting
> the guest application's data to disk when the application calls
> fsync is a recipe for disaster.

Exactly the same thing happens with real IDE disks if IDE write
caching (on the drive itself) is enabled, which it is by default.  It
is rarer, but it happens.

I've seen this with Linux 2.4 kernels writing to ext3 (real, not
virtual).  Filesystem metadata gets corrupted from time to time if
power is removed, because write ordering is not preserved.  Disabling
IDE write caching fixes it, but the performance impact is huge on some
systems.

Linux 2.6 kernels will issue IDE cache flush commands, at least with
ext3, to commit data to disk when fsync is called, and to preserve
journal/metadata ordering.

Doesn't qemu fsync the host file corresponding to the emulated disk,
when the guest OS issues an IDE cache flush?

For IDE emulation to be as reliable for data storage as a real disk,
it should:

    - fsync the host file whenever the guest OS issues an IDE cache
      flush command.

    - use O_SYNC (or fsync after each write or aio equivalent, etc.) _only_
      when the guest OS disables the IDE disk cache (not done by default).

-- JAmie

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29 17:33     ` Bill C. Riemers
@ 2006-07-30 21:47       ` Jamie Lokier
  0 siblings, 0 replies; 23+ messages in thread
From: Jamie Lokier @ 2006-07-30 21:47 UTC (permalink / raw)
  To: qemu-devel

Bill C. Riemers wrote:
>    How  about  compromising,  and  making  the  patch  a run time option.
>    Presumably  this  is  only  a  problem when the virtual machine is not
>    properly  shutdown.   For  those ho want the extra security of knowing
>    the  data  will  be written regardless of the shutdown status they can
>    enable  the  flag.  By default it could be turned off.  Then everybody
>    can be happy.

Real disks don't provide that security unless you disable the disk's
cache, or issue cache flush instructions to the disk.

Modern guest OS filesystems are written with this in mind.

With older guest OSes, you have to disable the disk cache if you want
that kind of security with real disks.

Is there any reason why the emulation should be any different?

-- Jamie

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] Re: [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-28 20:18   ` Rik van Riel
  2006-07-28 20:30     ` Paul Brook
@ 2006-07-31  7:08     ` Jens Axboe
  1 sibling, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2006-07-31  7:08 UTC (permalink / raw)
  To: qemu-devel

On Fri, Jul 28 2006, Rik van Riel wrote:
> Anthony Liguori wrote:
> 
> >Right now Fabrice is working on rewriting the block API to be
> >asynchronous.  There's been quite a lot of discussion about why using
> >threads isn't a good idea for this
> 
> Agreed, AIO is the way to go in the long run.
> 
> >With a proper async API, is there any reason why we would want this to be
> >tunable?  I don't think there's much of a benefit of prematurely claiming
> >a write is complete especially once the SCSI emulation can support
> >multiple simultaneous requests.
> 
> You're right.  This O_SYNC bandaid should probably stay in place
> to prevent data corruption, until the AIO framework is ready to
> be used.

O_SYNC is horrible, it'll totally kill performance. QEMU is basically
just a write cache enabled disk and it supports disk flushes as well. So
essentially it's the OS on top of QEMU that needs to take care for
flushing data out, like using barriers on the file system and
propagating fsync() properly down.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29 16:31         ` Paul Brook
@ 2006-07-31  7:08           ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2006-07-31  7:08 UTC (permalink / raw)
  To: qemu-devel

On Sat, Jul 29 2006, Paul Brook wrote:
> > Easy to do with the fsync infrastructure, but probably not worth
> > doing since people are working on the AIO I/O backend, which would
> > allow multiple outstanding writes from a guest.  That, in turn,
> > means I/O completion in the guest can be done when the data really
> > hits disk, but without a performance impact.
> 
> Not entirely true. That only works if you allow multiple guest IO
> requests in parallel, ie. some form of tagged command queueing. This
> requires either improving the SCSI emulation, or implementing SATA
> emulation. AFAIK parallel IDE doesn't support command queueing.

Parallel IDE does support queuing, but it never gained wide spread
support and the standard is quite broken as well (which is probably
_why_ it never got much adoption). It was also quite suboptimal from a
CPU efficiency POV.

Besides, async completion in itself is not enough, QEMU still needs to
honor ordered writes (barriers) and cache flushes.

> My impression what that the initial AIO implementation is just
> straight serial async operation. IO wouldn't actually go any faster,
> it just means the guest can do something else while it's waiting.

Depends on the app, if the io workload is parallel then you should see a
nice speedup as well (as QEMU is then no longer the serializing bottle
neck).

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-29 14:59   ` Rik van Riel
                       ` (2 preceding siblings ...)
  2006-07-30 21:41     ` Jamie Lokier
@ 2006-07-31  7:08     ` Jens Axboe
  2006-07-31  7:56       ` Jonas Maebe
  3 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2006-07-31  7:08 UTC (permalink / raw)
  To: qemu-devel

On Sat, Jul 29 2006, Rik van Riel wrote:
> Fabrice Bellard wrote:
> >Hi,
> >
> >Using O_SYNC for disk image access is not acceptable: QEMU relies on the 
> >host OS to ensure that the data is written correctly.
> 
> This means that write ordering is not preserved, and on a power
> failure any data written by qemu (or Xen fully virt) guests may
> not be preserved.
> 
> Applications running on the host can count on fsync doing the
> right thing, meaning that if they call fsync, the data *will*
> have made it to disk.  Applications running inside a guest have
> no guarantees that their data is actually going to make it
> anywhere when fsync returns...

Then the guest OS is broken. Applications issuing an fsync() should
issue a flush (or write-through), the guest OS should propagate this
knowledge through it's io stack and the QEMU hard drive should get
notified. If the guest OS isn't doing what it's supposed to, QEMU can't
help you. And, in fact, running your app on the same host OS with write
back caching would screw you as well. The timing window will probably be
larger with QEMU, but the problem is essentially the same.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-31  7:08     ` Jens Axboe
@ 2006-07-31  7:56       ` Jonas Maebe
  2006-07-31  8:18         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: Jonas Maebe @ 2006-07-31  7:56 UTC (permalink / raw)
  To: qemu-devel


On 31 jul 2006, at 09:08, Jens Axboe wrote:

>> Applications running on the host can count on fsync doing the
>> right thing, meaning that if they call fsync, the data *will*
>> have made it to disk.  Applications running inside a guest have
>> no guarantees that their data is actually going to make it
>> anywhere when fsync returns...
>
> Then the guest OS is broken.

The problem is that supposedly many OS'es are broken in this way. See
http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html


Jonas

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-31  7:56       ` Jonas Maebe
@ 2006-07-31  8:18         ` Jens Axboe
  0 siblings, 0 replies; 23+ messages in thread
From: Jens Axboe @ 2006-07-31  8:18 UTC (permalink / raw)
  To: qemu-devel

On Mon, Jul 31 2006, Jonas Maebe wrote:
> 
> On 31 jul 2006, at 09:08, Jens Axboe wrote:
> 
> >>Applications running on the host can count on fsync doing the
> >>right thing, meaning that if they call fsync, the data *will*
> >>have made it to disk.  Applications running inside a guest have
> >>no guarantees that their data is actually going to make it
> >>anywhere when fsync returns...
> >
> >Then the guest OS is broken.
> 
> The problem is that supposedly many OS'es are broken in this way. See
> http://lists.apple.com/archives/darwin-dev/2005/Feb/msg00072.html

Well, as others have written here as well, then their OS are broken on
"real" hardware as well.

I wouldn't be adverse to a QEMU work-around, but O_SYNC is clearly not a
viable alternative! We could make QEMU behave more like a real hard
drive when it has aio support, "flushing" dirty cache out in a manner
more closely mimicking what a drive would do instead of relying on the
page cache writeout deciding to write it out.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-30 21:41     ` Jamie Lokier
@ 2006-07-31  9:52       ` andrzej zaborowski
  2006-07-31 10:17         ` Jens Axboe
  0 siblings, 1 reply; 23+ messages in thread
From: andrzej zaborowski @ 2006-07-31  9:52 UTC (permalink / raw)
  To: qemu-devel

On 30/07/06, Jamie Lokier <jamie@shareable.org> wrote:
> Rik van Riel wrote:
> > This may look like hair splitting, but so far I've lost a
> > (test) postgresql database to this 3 times already.  Not getting
> > the guest application's data to disk when the application calls
> > fsync is a recipe for disaster.
>
> Exactly the same thing happens with real IDE disks if IDE write
> caching (on the drive itself) is enabled, which it is by default.  It
> is rarer, but it happens.

The little difference with QEMU is that there are two caches above it:
the host OS'es software cache and the IDE hardware cache. When a guest
OS flushes its own software cache its precious data goes to the host's
software cache while the guest thinks it's already the IDE cache. This
is ofcourse of less importance because data in both caches (hard- and
software) is lost when the power is cut off.

IMHO what really makes IO unreliable in QEMU is that IO errors on the
host are not reported to the guest by the IDE emulation and there's an
exact place in hw/ide.c where they are arrogantly ignored.

>
> I've seen this with Linux 2.4 kernels writing to ext3 (real, not
> virtual).  Filesystem metadata gets corrupted from time to time if
> power is removed, because write ordering is not preserved.  Disabling
> IDE write caching fixes it, but the performance impact is huge on some
> systems.
>
> Linux 2.6 kernels will issue IDE cache flush commands, at least with
> ext3, to commit data to disk when fsync is called, and to preserve
> journal/metadata ordering.
>
> Doesn't qemu fsync the host file corresponding to the emulated disk,
> when the guest OS issues an IDE cache flush?
>
> For IDE emulation to be as reliable for data storage as a real disk,
> it should:
>
>     - fsync the host file whenever the guest OS issues an IDE cache
>       flush command.
>
>     - use O_SYNC (or fsync after each write or aio equivalent, etc.) _only_
>       when the guest OS disables the IDE disk cache (not done by default).
>
> -- JAmie
>
>
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
>


-- 
balrog 2oo6

Dear Outlook users: Please remove me from your address books
http://www.newsforge.com/article.pl?sid=03/08/21/143258

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-31  9:52       ` andrzej zaborowski
@ 2006-07-31 10:17         ` Jens Axboe
  2006-07-31 17:50           ` andrzej zaborowski
  0 siblings, 1 reply; 23+ messages in thread
From: Jens Axboe @ 2006-07-31 10:17 UTC (permalink / raw)
  To: balrogg, qemu-devel

On Mon, Jul 31 2006, andrzej zaborowski wrote:
> On 30/07/06, Jamie Lokier <jamie@shareable.org> wrote:
> >Rik van Riel wrote:
> >> This may look like hair splitting, but so far I've lost a
> >> (test) postgresql database to this 3 times already.  Not getting
> >> the guest application's data to disk when the application calls
> >> fsync is a recipe for disaster.
> >
> >Exactly the same thing happens with real IDE disks if IDE write
> >caching (on the drive itself) is enabled, which it is by default.  It
> >is rarer, but it happens.
> 
> The little difference with QEMU is that there are two caches above it:
> the host OS'es software cache and the IDE hardware cache. When a guest
> OS flushes its own software cache its precious data goes to the host's
> software cache while the guest thinks it's already the IDE cache. This
> is ofcourse of less importance because data in both caches (hard- and
> software) is lost when the power is cut off.

But the drive cache does not let the dirty data linger for as long as
wht OS page/buffer cache.

> IMHO what really makes IO unreliable in QEMU is that IO errors on the
> host are not reported to the guest by the IDE emulation and there's an
> exact place in hw/ide.c where they are arrogantly ignored.

Send a patch, I'm pretty sure nobody would disagree :-)

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk
  2006-07-31 10:17         ` Jens Axboe
@ 2006-07-31 17:50           ` andrzej zaborowski
  0 siblings, 0 replies; 23+ messages in thread
From: andrzej zaborowski @ 2006-07-31 17:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: qemu-devel

On 31/07/06, Jens Axboe <qemu@kernel.dk> wrote:
> On Mon, Jul 31 2006, andrzej zaborowski wrote:
> > On 30/07/06, Jamie Lokier <jamie@shareable.org> wrote:
> > >Rik van Riel wrote:
> > >> This may look like hair splitting, but so far I've lost a
> > >> (test) postgresql database to this 3 times already.  Not getting
> > >> the guest application's data to disk when the application calls
> > >> fsync is a recipe for disaster.
> > >
> > >Exactly the same thing happens with real IDE disks if IDE write
> > >caching (on the drive itself) is enabled, which it is by default.  It
> > >is rarer, but it happens.
> >
> > The little difference with QEMU is that there are two caches above it:
> > the host OS'es software cache and the IDE hardware cache. When a guest
> > OS flushes its own software cache its precious data goes to the host's
> > software cache while the guest thinks it's already the IDE cache. This
> > is ofcourse of less importance because data in both caches (hard- and
> > software) is lost when the power is cut off.
>
> But the drive cache does not let the dirty data linger for as long as
> wht OS page/buffer cache.

I would say this an argument speaking for actually using O_SYNC.

>
> > IMHO what really makes IO unreliable in QEMU is that IO errors on the
> > host are not reported to the guest by the IDE emulation and there's an
> > exact place in hw/ide.c where they are arrogantly ignored.
>
> Send a patch, I'm pretty sure nobody would disagree :-)

Here's what I proposed:
http://lists.gnu.org/archive/html/qemu-devel/2005-12/msg00275.html but
I'm afraid it's not correct :P

>
> --
> Jens Axboe
>
>


-- 
balrog 2oo6

Dear Outlook users: Please remove me from your address books
http://www.newsforge.com/article.pl?sid=03/08/21/143258

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2006-07-31 17:50 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-07-28 19:54 [Qemu-devel] [RFC][PATCH] make sure disk writes actually hit disk Rik van Riel
2006-07-28 19:58 ` [Qemu-devel] " Rik van Riel
2006-07-28 20:12 ` Anthony Liguori
2006-07-28 20:18   ` Rik van Riel
2006-07-28 20:30     ` Paul Brook
2006-07-28 20:43       ` Rik van Riel
2006-07-28 21:01         ` Paul Brook
2006-07-31  7:08     ` Jens Axboe
2006-07-29  9:57 ` [Qemu-devel] " Fabrice Bellard
2006-07-29 14:59   ` Rik van Riel
2006-07-29 16:04     ` Paul Brook
2006-07-29 16:22       ` Rik van Riel
2006-07-29 16:31         ` Paul Brook
2006-07-31  7:08           ` Jens Axboe
2006-07-29 17:33     ` Bill C. Riemers
2006-07-30 21:47       ` Jamie Lokier
2006-07-30 21:41     ` Jamie Lokier
2006-07-31  9:52       ` andrzej zaborowski
2006-07-31 10:17         ` Jens Axboe
2006-07-31 17:50           ` andrzej zaborowski
2006-07-31  7:08     ` Jens Axboe
2006-07-31  7:56       ` Jonas Maebe
2006-07-31  8:18         ` Jens Axboe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).