All of lore.kernel.org
 help / color / mirror / Atom feed
From: Christoph Hellwig <hch@lst.de>
To: Anthony Liguori <anthony@codemonkey.ws>
Cc: Kevin Wolf <kwolf@redhat.com>, Christoph Hellwig <hch@lst.de>,
	qemu-devel <qemu-devel@nongnu.org>
Subject: [Qemu-devel] Re: Caching modes
Date: Tue, 21 Sep 2010 16:26:08 +0200	[thread overview]
Message-ID: <20100921142608.GA18290@lst.de> (raw)
In-Reply-To: <4C97F9C6.60501@codemonkey.ws>

On Mon, Sep 20, 2010 at 07:18:14PM -0500, Anthony Liguori wrote:
> O_DIRECT alone to a pre-allocated file on a normal file system should 
> result in the data being visible without any additional metadata 
> transactions.

Anthony, for the third time: no.  O_DIRECT is a non-portable extension
in Linux (taken from IRIX) and is defined as:


       O_DIRECT (Since Linux 2.4.10)
              Try  to minimize cache effects of the I/O to and from this file.
              In general this will degrade performance, but it  is  useful  in
              special  situations,  such  as  when  applications  do their own
              caching.  File I/O is done directly to/from user space  buffers.
              The O_DIRECT flag on its own makes at an effort to transfer data
              synchronously, but does not give the guarantees  of  the  O_SYNC
              that  data and necessary metadata are transferred.  To guarantee
              synchronous I/O the O_SYNC must be used in addition to O_DIRECT.
              See NOTES below for further discussion.

              A  semantically  similar  (but  deprecated)  interface for block
              devices is described in raw(8).

O_DIRECT does not have any meaning for data integrity, it just tells the
filesystem it *should* not use the pagecache.  Even if it should not
various filesystem have fallbacks to buffered I/O for corner cases.
It does *not* mean the actual disk cache gets flushed, and it *does*
not guarantee anything about metadata which is very important.

Metadata updates happen when filling sparse file, when extening the file
size, when using a COW filesystem, and when converting preallocated to
fully allocated extents in practice and could happen in many more cases
depending on the filesystem implementation.

> >Barriers are a Linux-specific implementation details that is in the
> >process of going away, probably in Linux 2.6.37.  But if you want
> >O_DSYNC semantics with a volatile disk write cache there is no way
> >around using a cache flush or the FUA bit on all I/O caused by it.
> 
> If you have a volatile disk write cache, then we don't need O_DSYNC 
> semantics.

If you present a volatile write cache to the guest you do indeed not
need O_DSYNC and can rely on the guest sending fdatasync calls when it
wants to flush the cache.  But for the statement above you can replace
O_DSYC with fdatasync and it will still be correct.  O_DSYNC in current
Linux kernels is nothing but an implicit range fdatasync after each
write.

> >   We
> >currently use the cache flush, and although I plan to experiment a bit
> >more with the FUA bit for O_DIRECT | O_DSYNC writes I would be very
> >surprised if they actually are any faster.
> >   
> 
> The thing I struggle with understanding is that if the guest is sending 
> us a write request, why are we sending the underlying disk a write + 
> flush request?  That doesn't seem logical at all to me.

We only send a cache flush request *iff* we present the guest a device
without a volatile write cache so that it can assume all writes are
stable and we sit on a device that does have a volatile write cache.

> Even if we advertise WC disable, it should be up to the guest to decide 
> when to issue flushes.

No.  If we don't claim to have a volatile cache no guest will ever flush
the cache.  Which is just logially given that we just told it that we
don't have a cache that needs flushing.

> >ext3 and ext4 have really bad fsync implementations.  Just use a better
> >filesystem or bug one of it's developers if you want that fixed.  But
> >except for disabling the disk cache there is no way to get data integrity
> >without cache flushes (the FUA bit is nothing but an implicit flush).
> >   
> 
> But why are we issuing more flushes than the guest is issuing if we 
> don't have to worry about filesystem metadata (i.e. preallocated storage 
> or physical devices)?

Who is "we" and what is workload/filesystem/kernel combination?
Specific details and numbers please.

  parent reply	other threads:[~2010-09-21 14:27 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-09-20 16:53 [Qemu-devel] Caching modes Anthony Liguori
2010-09-20 18:37 ` Blue Swirl
2010-09-20 18:51   ` Anthony Liguori
2010-09-20 19:34 ` [Qemu-devel] " Christoph Hellwig
2010-09-20 20:11   ` Anthony Liguori
2010-09-20 23:17     ` Christoph Hellwig
2010-09-21  0:18       ` Anthony Liguori
2010-09-21  8:15         ` Kevin Wolf
2010-09-21 14:26         ` Christoph Hellwig [this message]
2010-09-21 15:13           ` Anthony Liguori
2010-09-21 20:57             ` Christoph Hellwig
2010-09-21 21:27               ` Anthony Liguori

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100921142608.GA18290@lst.de \
    --to=hch@lst.de \
    --cc=anthony@codemonkey.ws \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.