qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <qemu@kernel.dk>
To: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Ensuring data is written to disk
Date: Tue, 1 Aug 2006 21:05:05 +0200	[thread overview]
Message-ID: <20060801190505.GA20108@suse.de> (raw)
In-Reply-To: <20060801141705.GA7779@mail.shareable.org>

On Tue, Aug 01 2006, Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, Aug 01 2006, Jamie Lokier wrote:
> > > > Of course, guessing the disk drive write buffer size and trying not to kill
> > > > system I/O performance with all these writes is another question entirely
> > > > ... sigh !!!
> > > 
> > > If you just want to evict all data from the drive's cache, and don't
> > > actually have other data to write, there is a CACHEFLUSH command you
> > > can send to the drive which will be more dependable than writing as
> > > much data as the cache size.
> > 
> > Exactly, and this is what the OS fsync() should do once the drive has
> > acknowledged that the data has been written (to cache). At least
> > reiserfs w/barriers on Linux does this.
> 
> 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk
>    is an SATA or SCSI type that supports ordered tagged commands?  My
>    understanding is that barriers force an ordering between write
>    commands, and that CACHEFLUSH is used only with disks that don't have
>    more sophisticated write ordering commands.  Is the data still
>    committed to the disk platter before fsync() returns on those?

No SATA drive supports ordered tags, that is a SCSI only property. The
barrier writes is a separate thing, probably reiser ties the two
together because it needs to know if the flush cache command works as
expected. Drives are funny sometimes...

For SATA you always need at least one cache flush (you need one if you
have the FUA/Forced Unit Access write available, you need two if not).

> 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
>    for in-place writes which don't modify the inode and therefore don't
>    have a journal entry?

I don't think that it does, however it may have changed. A quick grep
would seem to indicate that it has not changed.

> On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> it has an fcntl F_FULLSYNC which does that, which is documented in
> Darwin's fsync() page as working with all Darwin's filesystems,
> provided the hardware honours CACHEFLUSH or the equivalent.

That seems somewhat strange to me, I'd much rather be able to say that
fsync() itself is safe. An added fcntl hack doesn't really help the
applications that already rely on the correct behaviour.

> rom what little documentation I've found, on Linux it appears to be
> much less predictable.  It seems that some filesystems, with some
> kernel versions, and some mount options, on some types of disk, with
> some drive settings, will commit data to a platter before fsync()
> returns, and others won't.  And an application calling fsync() has no
> easy way to find out.  Have I got this wrong?

Nope, I'm afraid that is pretty much true... reiser and (it looks like,
just grepped) XFS has best support for this. Unfortunately I don't think
the user can actually tell if the OS does the right thing, outside of
running a blktrace and verifying that it actually sends a flush cache
down the queue.

> ps. (An aside question): do you happen to know of a good patch which
> implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> googling, but it seemed that the ext3 parts might not be finished, so
> I don't trust it.  I've found turning off the IDE write cache makes
> writes safe, but with a huge performance cost.

The hard part (the IDE code) can be grabbed from the SLES8 latest
kernels, I developed and tested the code there. That also has the ext3
bits, IIRC.

-- 
Jens Axboe

  reply	other threads:[~2006-08-01 19:04 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-08-01  0:11 [Qemu-devel] Ensuring data is written to disk Armistead, Jason
2006-08-01 10:17 ` Jamie Lokier
2006-08-01 10:45   ` Jens Axboe
2006-08-01 14:17     ` Jamie Lokier
2006-08-01 19:05       ` Jens Axboe [this message]
2006-08-01 21:50         ` Jamie Lokier
2006-08-02  6:51           ` Jens Axboe
2006-08-02 13:28             ` Jamie Lokier
2006-08-02 15:56               ` Bill C. Riemers
2006-08-07 13:11             ` R. Armiento
2006-08-07 16:14               ` Bill C. Riemers
2006-08-07 18:13               ` Thomas Steffen
2006-08-08  2:37                 ` R. Armiento

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20060801190505.GA20108@suse.de \
    --to=qemu@kernel.dk \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).