From: Dave Chinner <david@fromorbit.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Ryan Lortie <desrt@desrt.ca>, linux-ext4@vger.kernel.org
Subject: Re: ext4 file replace guarantees
Date: Sat, 22 Jun 2013 13:29:44 +1000 [thread overview]
Message-ID: <20130622032944.GX29376@dastard> (raw)
In-Reply-To: <20130621203547.GA10582@thunk.org>
On Fri, Jun 21, 2013 at 04:35:47PM -0400, Theodore Ts'o wrote:
> So I've been taking a closer look at the the rename code, and there's
> something I can do which will improve the chances of avoiding data
> loss on a crash after an application tries to replace file contents
> via:
>
> 1) write foo.new
> 2) <omit fsync of foo.new>
> 3) rename foo.new to foo
>
> Those are the kernel patches that I cc'ed you on.
>
> The reason why it's still not a guarantee is because we are not doing
> a file integrity writeback; this is not as important for small files,
> but if foo.new is several megabytes, not all of the data blocks will
> be flushed out before the rename, and this will kill performance, and
> in somoe cases it might not be necessary.
>
> Still, for small files ("most config files are smaller than 100k"),
> this should serve you just fine. Of course, it's not going to be in
> currently deployed kernels, so I don't know how much these proposed
> patches will help you,. I'm doing mainly because it helps protects
> users against (in my mind) unwise application programmers, and it
> doesn't cost us any extra performance from what we are currently
> doing, so why not improve things a little?
>
>
> If you want better guarantees than that, this is the best you can do:
>
> 1) write foo.new using file descriptor fd
> 2) sync_file_range(fd, 0, 0, SYNC_FILE_RANGE_WRITE);
> 3) rename foo.new to foo
>
> This will work on today's kernels, and it should be safe to do for all
> file systems.
No, it's not. SYNC_FILE_RANGE_WRITE does not wait for IO completion,
and not all filesystems sychronise journal flushes with data write
IO completion.
Indeed, we have a current "data corruption after power failure"
problem found on Ceph storage clusters using XFS for the OSD storage
that is specifically triggered by the use of SYNC_FILE_RANGE_WRITE
rather than using fsync() to get data to disk.
http://oss.sgi.com/pipermail/xfs/2013-June/027390.html
The question was raised as to whether sync_file_range() was safe on
ext4 was asked and my response was:
http://oss.sgi.com/pipermail/xfs/2013-June/027452.html
"> Is sync_file_range(2) similarly problematic with ext4?
In data=writeback mode, most definitely. For data=ordered, I have no
idea - the writeack paths in ext4 are ... convoluted, and I hurt my
brain every time I look at them. I wouldn't be surprised if there
are problems, but they'll be different problems because ext4 doesn't
do speculative prealloc..."
.....
> > aside: what's your opinion on fdatasync()? Seems like it wouldn't be
> > good enough for my usecase because I'm changing the size of the file....
>
> fdatasync() is basically sync_file_range() plus a CACHE FLUSH command.
> Like sync_file_range, it doesn't sync the metadata (and by the way,
> this includes things like indirect blocks for ext2/3 or extent tree
> blocks for ext4).
If fdatasync() on ext4 doesn't sync metadata blocks required to
access the data that was just written by the fdatasync() call, then
it is broken.
fdatasync() is supposed to guarantee all the data in the file and
all the metadata *needed to access that data* is on stable storage
by the time the fdatasync() completes. i.e. fdatasync() might just
be a data write and cache flush, but in the case where allocation,
file size changes, etc have occurred, it is effectively the
equivalent of a full fsync().
So, fdatasync() will do what you want, but the performance overhead
will be no different to fsync() in the rename case because all the
metadata pointing to the tmp file needs to comitted as well...
----
But, let me make a very important point here. Nobody should be
trying to optimise a general purpose application for a specific
filesystem's data integrity behaviour. fsync() and fdatasync() are
the gold standards as it is consistently implemented across all
Linux filesystems.
The reason I say this is that we've been down this road before and
we shoul dhave learnt better from it. Ted, you should recognise this
because you were front and centre in the fallout of it:
http://tytso.livejournal.com/61989.html
".... Application writers had gotten lazy, because ext3 by default
has a commit interval of 5 seconds, and and uses a journalling mode
called data=ordered. What does this mean? ....
... Since ext3 became the dominant filesystem for Linux, application
writers and users have started depending on this, and so they become
shocked and angry when their system locks up and they lose data -
even though POSIX never really made any such guarantee. ..."
This discussion of "how can we abuse ext4 data=ordered sematics to
avoid using fsync()" is heading right going down this path again.
It is starting from "fsync on ext4 is too slow", and solutions are
being proposed that assume that either everyone is use ext4
(patently untrue) and that all filesystems behave like ext4 (also
patently untrue).
To all the application developers reading this: just use
fsync()/fdatasync() for operations that require data integrity. Your
responisbility is to your users: using methods that don't guarantee
data integrity and therefore will result in data loss is indicating
you don't place any value on your user's data what-so-ever. There is
no place for being fancy when it comes to data integrity - it needs
to be reliable and rock-solid.
If that means your application is slow, then you need to explain why
it is slow to your users and how they can change a knob to make it
fast by trading off data integrity. The user can make the choice at
that point, and they have no grounds to complain if they lose data
at that point because they made a conscious choice to configure
their system that way.
IOWs, the choice of whether data can be lost on a crash is one that
only the user can make. As such, applications need be
safe-by-default when it comes to data integrity.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2013-06-22 3:29 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-06-20 21:34 ext4 file replace guarantees Ryan Lortie
2013-06-21 0:59 ` Theodore Ts'o
2013-06-21 12:43 ` Ryan Lortie
2013-06-21 13:15 ` Theodore Ts'o
2013-06-21 13:51 ` Ryan Lortie
2013-06-21 14:33 ` Theodore Ts'o
2013-06-21 15:24 ` Ryan Lortie
2013-06-21 20:35 ` Theodore Ts'o
2013-06-22 3:29 ` Dave Chinner [this message]
2013-06-22 4:47 ` Theodore Ts'o
2013-06-22 13:40 ` Sidorov, Andrei
2013-06-22 14:06 ` Theodore Ts'o
2013-06-22 14:41 ` Sidorov, Andrei
2013-06-23 1:58 ` Dave Chinner
2013-06-21 16:25 ` Joseph D. Wagner
2013-06-21 21:05 ` Theodore Ts'o
2013-06-21 21:49 ` Sidorov, Andrei
2013-06-22 12:56 ` Theodore Ts'o
2013-06-22 14:01 ` Sidorov, Andrei
2013-06-22 14:30 ` Theodore Ts'o
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130622032944.GX29376@dastard \
--to=david@fromorbit.com \
--cc=desrt@desrt.ca \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox