From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1G7zXl-0007Me-MA
	for qemu-devel@nongnu.org; Tue, 01 Aug 2006 15:04:41 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1G7zXk-0007M1-Lw
	for qemu-devel@nongnu.org; Tue, 01 Aug 2006 15:04:41 -0400
Received: from [199.232.76.173] (helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1G7zXk-0007Lx-Gx
	for qemu-devel@nongnu.org; Tue, 01 Aug 2006 15:04:40 -0400
Received: from [80.160.77.98] (helo=pasmtp.tele.dk)
	by monty-python.gnu.org with esmtp (Exim 4.52) id 1G7zag-00042R-S4
	for qemu-devel@nongnu.org; Tue, 01 Aug 2006 15:07:43 -0400
Received: from router.home.kernel.dk (brick.kernel.dk [62.242.22.158])
	by pasmtp.tele.dk (Postfix) with ESMTP id 2385CE3244A
	for <qemu-devel@nongnu.org>; Tue,  1 Aug 2006 21:04:35 +0200 (CEST)
Received: from kernel.dk (nelson.home.kernel.dk [192.168.0.33])
	by router.home.kernel.dk (Postfix) with ESMTP id B5BDF256E8D
	for <qemu-devel@nongnu.org>; Tue,  1 Aug 2006 21:04:35 +0200 (CEST)
Date: Tue, 1 Aug 2006 21:05:05 +0200
From: Jens Axboe <qemu@kernel.dk>
Subject: Re: [Qemu-devel] Ensuring data is written to disk
Message-ID: <20060801190505.GA20108@suse.de>
References: <A69CFE5B2F49D91186C1000BCD9DBD03E2AB38@otausminexs.au.otis.com>
	<20060801101743.GA31760@mail.shareable.org>
	<20060801104539.GO31908@suse.de>
	<20060801141705.GA7779@mail.shareable.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20060801141705.GA7779@mail.shareable.org>
Reply-To: qemu-devel@nongnu.org
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org

On Tue, Aug 01 2006, Jamie Lokier wrote:
> Jens Axboe wrote:
> > On Tue, Aug 01 2006, Jamie Lokier wrote:
> > > > Of course, guessing the disk drive write buffer size and trying not to kill
> > > > system I/O performance with all these writes is another question entirely
> > > > ... sigh !!!
> > > 
> > > If you just want to evict all data from the drive's cache, and don't
> > > actually have other data to write, there is a CACHEFLUSH command you
> > > can send to the drive which will be more dependable than writing as
> > > much data as the cache size.
> > 
> > Exactly, and this is what the OS fsync() should do once the drive has
> > acknowledged that the data has been written (to cache). At least
> > reiserfs w/barriers on Linux does this.
> 
> 1. Are you sure this happens, w/ reiserfs on Linux, even if the disk
>    is an SATA or SCSI type that supports ordered tagged commands?  My
>    understanding is that barriers force an ordering between write
>    commands, and that CACHEFLUSH is used only with disks that don't have
>    more sophisticated write ordering commands.  Is the data still
>    committed to the disk platter before fsync() returns on those?

No SATA drive supports ordered tags, that is a SCSI only property. The
barrier writes is a separate thing, probably reiser ties the two
together because it needs to know if the flush cache command works as
expected. Drives are funny sometimes...

For SATA you always need at least one cache flush (you need one if you
have the FUA/Forced Unit Access write available, you need two if not).

> 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does it too,
>    for in-place writes which don't modify the inode and therefore don't
>    have a journal entry?

I don't think that it does, however it may have changed. A quick grep
would seem to indicate that it has not changed.

> On Darwin, fsync() does not issue CACHEFLUSH to the drive.  Instead,
> it has an fcntl F_FULLSYNC which does that, which is documented in
> Darwin's fsync() page as working with all Darwin's filesystems,
> provided the hardware honours CACHEFLUSH or the equivalent.

That seems somewhat strange to me, I'd much rather be able to say that
fsync() itself is safe. An added fcntl hack doesn't really help the
applications that already rely on the correct behaviour.

> rom what little documentation I've found, on Linux it appears to be
> much less predictable.  It seems that some filesystems, with some
> kernel versions, and some mount options, on some types of disk, with
> some drive settings, will commit data to a platter before fsync()
> returns, and others won't.  And an application calling fsync() has no
> easy way to find out.  Have I got this wrong?

Nope, I'm afraid that is pretty much true... reiser and (it looks like,
just grepped) XFS has best support for this. Unfortunately I don't think
the user can actually tell if the OS does the right thing, outside of
running a blktrace and verifying that it actually sends a flush cache
down the queue.

> ps. (An aside question): do you happen to know of a good patch which
> implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch by
> googling, but it seemed that the ext3 parts might not be finished, so
> I don't trust it.  I've found turning off the IDE write cache makes
> writes safe, but with a huge performance cost.

The hard part (the IDE code) can be grabbed from the SLES8 latest
kernels, I developed and tested the code there. That also has the ext3
bits, IIRC.

-- 
Jens Axboe