From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from [140.186.70.92] (port=43547 helo=eggs.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1Oy3oH-0003ki-KJ
	for qemu-devel@nongnu.org; Tue, 21 Sep 2010 10:27:23 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.69)
	(envelope-from <hch@lst.de>) id 1Oy3nQ-0003h7-Dp
	for qemu-devel@nongnu.org; Tue, 21 Sep 2010 10:26:59 -0400
Received: from verein.lst.de ([213.95.11.210]:40595)
	by eggs.gnu.org with esmtp (Exim 4.69) (envelope-from <hch@lst.de>)
	id 1Oy3nQ-0003gm-1z
	for qemu-devel@nongnu.org; Tue, 21 Sep 2010 10:26:12 -0400
Date: Tue, 21 Sep 2010 16:26:08 +0200
From: Christoph Hellwig <hch@lst.de>
Message-ID: <20100921142608.GA18290@lst.de>
References: <4C97916E.2080801@codemonkey.ws> <20100920193451.GA11516@lst.de>
	<4C97BFF3.90103@codemonkey.ws> <20100920231742.GB18512@lst.de>
	<4C97F9C6.60501@codemonkey.ws>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <4C97F9C6.60501@codemonkey.ws>
Subject: [Qemu-devel] Re: Caching modes
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>, Christoph Hellwig <hch@lst.de>, qemu-devel <qemu-devel@nongnu.org>

On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
> O_DIRECT alone to a pre-allocated file on a normal file system should 
> result in the data being visible without any additional metadata 
> transactions.

Anthony, for the third time: no.  O_DIRECT is a non-portable extension
in Linux (taken from IRIX) and is defined as:


       O_DIRECT (Since Linux 2.4.10)
              Try  to minimize cache effects of the I/O to and from this file.
              In general this will degrade performance, but it  is  useful  in
              special  situations,  such  as  when  applications  do their own
              caching.  File I/O is done directly to/from user space  buffers.
              The O_DIRECT flag on its own makes at an effort to transfer data
              synchronously, but does not give the guarantees  of  the  O_SYNC
              that  data and necessary metadata are transferred.  To guarantee
              synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
              See NOTES below for further discussion.

              A  semantically  similar  (but  deprecated)  interface for block
              devices is described in raw(8).

O_DIRECT does not have any meaning for data integrity, it just tells the
filesystem it *should* not use the pagecache.  Even if it should not
various filesystem have fallbacks to buffered I/O for corner cases.
It does *not* mean the actual disk cache gets flushed, and it *does*
not guarantee anything about metadata which is very important.

Metadata updates happen when filling sparse file, when extening the file
size, when using a COW filesystem, and when converting preallocated to
fully allocated extents in practice and could happen in many more cases
depending on the filesystem implementation.

> >Barriers are a Linux-specific implementation details that is in the
> >process of going away, probably in Linux 2.6.37.  But if you want
> >O_DSYNC semantics with a volatile disk write cache there is no way
> >around using a cache flush or the FUA bit on all I/O caused by it.
> 
> If you have a volatile disk write cache, then we don't need O_DSYNC 
> semantics.

If you present a volatile write cache to the guest you do indeed not
need O_DSYNC and can rely on the guest sending fdatasync calls when it
wants to flush the cache.  But for the statement above you can replace
O_DSYC with fdatasync and it will still be correct.  O_DSYNC in current
Linux kernels is nothing but an implicit range fdatasync after each
write.

> >   We
> >currently use the cache flush, and although I plan to experiment a bit
> >more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
> >surprised if they actually are any faster.
> >   
> 
> The thing I struggle with understanding is that if the guest is sending 
> us a write request, why are we sending the underlying disk a write + 
> flush request?  That doesn't seem logical at all to me.

We only send a cache flush request *iff* we present the guest a device
without a volatile write cache so that it can assume all writes are
stable and we sit on a device that does have a volatile write cache.

> Even if we advertise WC disable, it should be up to the guest to decide 
> when to issue flushes.

No.  If we don't claim to have a volatile cache no guest will ever flush
the cache.  Which is just logially given that we just told it that we
don't have a cache that needs flushing.

> >ext3 and ext4 have really bad fsync implementations.  Just use a better
> >filesystem or bug one of it's developers if you want that fixed.  But
> >except for disabling the disk cache there is no way to get data integrity
> >without cache flushes (the FUA bit is nothing but an implicit flush).
> >   
> 
> But why are we issuing more flushes than the guest is issuing if we 
> don't have to worry about filesystem metadata (i.e. preallocated storage 
> or physical devices)?

Who is "we" and what is workload/filesystem/kernel combination?
Specific details and numbers please.