Re: [Qemu-devel] Ensuring data is written to disk

From: "Bill C. Riemers" <docbill@freeshell.org>
To: qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] Ensuring data is written to disk
Date: Wed, 2 Aug 2006 11:56:16 -0400	[thread overview]
Message-ID: <23bcb8700608020856v5af79ae4r23b4a62035cee040@mail.gmail.com> (raw)
In-Reply-To: <20060802132849.GA13904@mail.shareable.org>

[-- Attachment #1: Type: text/plain, Size: 12723 bytes --]

Just to throw in my two cents, I notice that on the namesys website, they
claim reiser4 is completely safe in the event of a power failure, while
reiserfs 3 still requires some recovery.  Apparently in reiser4 they somehow
design writes to happen in sequences that create atomic events.  So the
whole change is there, or none of it.  I am not sure how this is
accomplished given the state of disk caching...  Perhaps that is why they
don't consider reiser4 ready for prime time use.

Bill

On 8/2/06, Jamie Lokier <jamie@shareable.org> wrote:
>
> Jens Axboe wrote:
> > > > For SATA you always need at least one cache flush (you need one if
> you
> > > > have the FUA/Forced Unit Access write available, you need two if
> not).
> > >
> > > Well my question wasn't intended to be specific to ATA (sorry if that
> > > wasn't clear), but a general question about writing to disks on Linux.
> > >
> > > And I don't understand your answer.  Are you saying that reiserfs on
> > > Linux (presumably 2.6) commits data (and file metadata) to disk
> > > platters before returning from fsync(), for all types of disk
> > > including PATA, SATA and SCSI?  Or if not, is that a known property of
> > > PATA only, or PATA and SATA only?  (And in all cases, presumably only
> > > "ordinary" controllers can be depended on, not RAID controllers or
> > > USB/Firewire bridges which ignore cache flushes for no good reason).
> >
> > blkdev_issue_flush() is brutal, but it works on SATA/PATA/SCSI. So yes,
> > it should eb reliable.
>
> Ah, thanks.  I've looked at that bit of reiserfs, xfs and ext3 now.
>
> It looks like adding a single call to blkdev_issue_flush() at the end
> of ext3_sync_file() would do the trick.  I'm surprised that one-line
> patch isn't in there already.
>
> Of course that doesn't help with writing an application to reliably
> commit on existing systems.
>
> > > > > 2. Do you know if ext3 (in ordered mode) w/barriers on Linux does
> it too,
> > > > >    for in-place writes which don't modify the inode and therefore
> don't
> > > > >    have a journal entry?
> > > >
> > > > I don't think that it does, however it may have changed. A quick
> grep
> > > > would seem to indicate that it has not changed.
> > >
> > > Ew.  What do databases do to be reliable then?  Or aren't they, on
> Linux?
> >
> > They probably run on better storage than commodity SATA drives with
> > write back caching enabled. To my knowledge, Linux is one of the only OS
> > that even attempts to fix this.
>
> I would imagine most of the MySQL databases backing small web sites
> run on commodity PATA or SATA drives, and that most people have
> assumed fsync() to be good enough for database commits in the absence
> of hardware failure, or when one disk goes down in a RAID.  Time to
> correct those misassumption!
>
> > > > > On Darwin, fsync() does not issue CACHEFLUSH to the
> drive.  Instead,
> > > > > it has an fcntl F_FULLSYNC which does that, which is documented in
> > > > > Darwin's fsync() page as working with all Darwin's filesystems,
> > > > > provided the hardware honours CACHEFLUSH or the equivalent.
> > > >
> > > > That seems somewhat strange to me, I'd much rather be able to say
> that
> > > > fsync() itself is safe. An added fcntl hack doesn't really help the
> > > > applications that already rely on the correct behaviour.
> > >
> > > According to the Darwin fsync(2) man page, it claims Darwin is the
> > > only OS which has a facility to commit the data to disk platters.
> > > (And it claims to do this with IDE, SCSI and FibreChannel.  With
> > > journalling filesystems, it requests the journal to do the commit but
> > > the cache flush still ultimately reaches the disk.  Sounds like a good
> > > implementation to me).
> >
> > The implementation may be nice, but it's the idea that is appalling to
> > me. But it sounds like the Darwin man page is out of date, or at least
> > untrue.
> >
> > > SQLite (a nice open source database) will use F_FULLSYNC on Darwin to
> > > do this, and it appears to add a large performance penalty relative to
> > > using fsync() alone.  People noticed and wondered why.
> >
> > Disk cache flushes are nasty, they stall everything. But it's still
> > typically faster than disabling write back caching, so...
>
> I agree that it's nasty.  But then, the fsync() interface is rather
> sub-optimal.  E.g. something like sendmail which writes a new file
> needs to fsync() on the file _and_ its parent directory.  You don't
> want two disk flushes then, just one after both fsync() calls have
> completed.  Similarly if you're doing anything where you want to
> commit data to more than one file.  An fsync_multi() interface would
> be more efficient.
>
> > > Other OSes show similar performance as Darwin with fsync() only.
> > >
> > > So it looks like the man page is probably accurate: other OSes,
> > > particularly including Linux, don't commit the data reliably to disk
> > > platters when using fsync().
> >
> > How did you reach that conclusion?
>
> >From seeing the reported timings for SQLite on Linux and Darwin
> with/without F_FULLSYNC.  The Linux timings were similar to Darwin
> without F_FULLSYNC.  Others and myself assumed the timings are
> probably I/O bound, and reflect the transactions going to disk.  But
> it could be Darwin being slower :-)
>
> > reiser certainly does it if you have barriers enabled (which you
> > need anyways to be safe with write back caching), and with a little
> > investigation we can perhaps conclude that XFS is safe as well.
>
> Yes, reiser and XFS look quite convincing.  Although I notice the
> blkdev_issue_flush is conditional in both, and the condition is
> non-trivial.  I'll assume the authors thought specifically about this.
>
> > > In which case, I'd imagine that's why Darwin has a separate option,
> > > because if Darwin's fsync() was many times slower than all the other
> > > OSes, most people would take that as a sign of a badly performing OS,
> > > rather than understanding the benefits.
> >
> > That sounds like marketing driven engineering, nice. It requires app
> > changes, which is pretty silly. I would much rather have a way of just
> > enabling/disabling full flush on a per-device basis, you could use the
> > cache type as the default indicator of whether to issue the cache flush
> > or not. Then let the admin override it, if he wants to run unsafe but
> > faster.
>
> I agree, that makes sense to me too.
>
> > > > > from what little documentation I've found, on Linux it appears to
> be
> > > > > much less predictable.  It seems that some filesystems, with some
> > > > > kernel versions, and some mount options, on some types of disk,
> with
> > > > > some drive settings, will commit data to a platter before fsync()
> > > > > returns, and others won't.  And an application calling fsync() has
> no
> > > > > easy way to find out.  Have I got this wrong?
> > > >
> > > > Nope, I'm afraid that is pretty much true... reiser and (it looks
> like,
> > > > just grepped) XFS has best support for this. Unfortunately I don't
> think
> > > > the user can actually tell if the OS does the right thing, outside
> of
> > > > running a blktrace and verifying that it actually sends a flush
> cache
> > > > down the queue.
> > >
> > > Ew.  So what do databases on Linux do?  Or are database commits
> > > unreliable because of this?
> >
> > See above.
>
> I conclude that database commits _are_ unreliable on Linux on a
> disturbingly large number of smaller setups.
>
> With ext3 on 2.6 and IDE write cache enabled, fsync() does not even
> guarantee the ordering of writes, let alone commit them properly.
> This is because it omits a journal commit (and hence IDE barrier), if
> the data writes haven't changed the inode, which they don't if it's
> within the 1-second mtime granularity.
>
> O_SYNC on ext3 suffers the same problems.  (I don't know if O_SYNC
> commits data to platters on reiser and XFS, or maintains write
> ordering; I guess that fsync() should be called when those are
> needed).
>
> Considering the marketing of ext3 as offering data integrity, I'm
> disappointed.
>
> An ugly workaround suggests itself, which is to forcibly modify the
> inode after writing and before calling fsync(): write, utime, utime,
> fsync.  As a side effect of the journal barrier, it will cause a cache
> flush to disk.
>
> > > > > ps. (An aside question): do you happen to know of a good patch
> which
> > > > > implements IDE barriers w/ ext3 on 2.4 kernels?  I found a patch
> by
> > > > > googling, but it seemed that the ext3 parts might not be finished,
> so
> > > > > I don't trust it.  I've found turning off the IDE write cache
> makes
> > > > > writes safe, but with a huge performance cost.
> > > >
> > > > The hard part (the IDE code) can be grabbed from the SLES8 latest
> > > > kernels, I developed and tested the code there. That also has the
> ext3
> > > > bits, IIRC.
> > >
> > > Thanks muchly!  I will definitely take a look at that.  I'm working on
> > > a uClinux project which must use a 2.4 kernel, and performance with
> > > write cache off has been a real problem.  And I've seen fs corruption
> > > after power cycles with write cache on many times, as expected.
> >
> > No problem.
>
> Have looked, it's most helpful, and I will use your patches.
> Ironically, that 2.4 patch seems to include reliable commits w/ ext3,
> because every fsync() commits a journal entry.  Er, I think.  (It was
> optimised away in 2.6: http://lkml.org/lkml/2004/3/18/36).
>
> > > It's a shame the ext3 bits don't do fsync() to the platter though. :-/
> >
> > It really is, apparently none of the ext3 guys care about write back
> > caching problems. The only guy wanting to help with the ext3 bits was
> > Andrew. In the reiserfs guys favor, they have actively been pursuing
> > solutions to this problem. And XFS recently caught up and should be just
> > as good on the barrier side, I have yet to verify the fsync() part.
>
> There's a call to blkdev_issue_flush in XFS fsync(), so it looks
> promising.  I'm not sure what the condition for calling it depends on
> though, but it seems likely the authors have thought it through.
>
> > > To reliably commit data to an ext3 file, should we do ioctl(block_dev,
> > > HDIO_SET_WCACHE, 1) on 2.6 kernels on IDE?  (The side effects look to
> >
> > Did you mean (..., 0)? And yes, it looks like it right now that fsync()
> > isn't any better than other OS on ext3, so disabling write back caching
> > is the safest.
>
> I meant (..., 1).  For some reason I thought the call to
> update_ordered() in ide-disk.c issued a barrier, a convenient side
> effect of HDIO_SET_WCACHE.  But on re-reading, it doesn't issue a
> barrier.  So that's not a solution.
>
> (..., 0) sucks performance wise.  I think calling utime to dirty the
> inode prior to fsync() will work with ext3, but it's ugly for many
> reasons, not least that it will work on IDE, but it won't work on
> anything (e.g. SCSI) which uses ordered tags rather than flushes.
>
> > > me like they may create a barrier then flush the cache, even when it's
> > > already enabled, but only on 2.6 kernels).  Or is there a better way?
> > > (I don't see any way to do it on vanilla 2.4 kernels).
> >
> > 2.4 vanilla doesn't have barrier support, unfortunately.
>
> I was wondering how to force an IDE cache flush on 2.4, from the
> application after it's called fsync().  No barrier support implied.  I
> guess there is some way to do it using the IDE taskfile ioctls?
> Nothing is clear here, unfortunately.
>
> I'm surprised blkdev_issue_flush (or the equivalent in 2.4) isn't
> available to userspace through a block device ioctl.  There is
> BLKFLSBUF which _almost_ pretends to do it, but that doesn't issue a
> low-level disk flush, and it invalidates the read-cached data.
>
> > > Should we change to only reiserfs and expect fsync() to commit data
> > > reliably only with that fs?  I realise this is a lot of difficult
> > > questions, that apply to more than just Qemu...
> >
> > Yes, reiser is the only one that works reliably across power loss with
> > write back caching for the journal commits as well as fsync guarantees.
>
> I'll try it.  I see enough problems with ext3 on a tiny embedded
> system (writes stalling for a long time, read-cached data being
> re-read from disk every 5 seconds) that I was avoiding reiser because
> I thought it would be more complicated.  That, and I have high faith
> in e2fsck.  But given the problems with ext3, maybe I'll get better
> embedded results with reiser :)
>
> -- Jamie
>
>
> _______________________________________________
> Qemu-devel mailing list
> Qemu-devel@nongnu.org
> http://lists.nongnu.org/mailman/listinfo/qemu-devel
>

[-- Attachment #2: Type: text/html, Size: 14846 bytes --]