From: Theodore Tso <tytso@mit.edu>
To: Jamie Lokier <jamie@shareable.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
Valerie Aurora Henson <vaurora@redhat.com>,
linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org,
Chris Mason <chris.mason@oracle.com>,
Eric Sandeen <sandeen@redhat.com>,
Ric Wheeler <rwheeler@redhat.com>, Nick Piggin <npiggin@suse.de>
Subject: Re: fsync_range_with_flags() - improving sync_file_range()
Date: Thu, 23 Apr 2009 17:13:05 -0400 [thread overview]
Message-ID: <20090423211305.GN2723@mit.edu> (raw)
In-Reply-To: <20090423204411.GF13326@shareable.org>
On Thu, Apr 23, 2009 at 09:44:11PM +0100, Jamie Lokier wrote:
> Yes that's the page I've read and didn't find useful :-)
> The data-locating metadata is explained thus:
>
> None of these operations write out the file’s metadata. Therefore,
> unless the application is strictly performing overwrites of already-
> instantiated disk blocks, there are no guarantees that the data will be
> available after a crash.
Well, I thought that was clear. Today, sync_file_range(2) only works
if the data-localting metadata is already on the disk. This is useful
for databases where the tablespace is allocated ahead of time, but not
much else.
> But a kernel thread from Feb 2008 revealed the truth:
> sync_file_range() _doesn't_ commit data on such filesystems.
Because we could very easily add a flag which would cause it to commit
the data-locating metadata blocks --- or maybe we change it so that it
does commit the data-locating metadata, on the assumption that if the
data-locating metadata is already committed, which would be true for
all of its existing users, it's a no-op, and if it isn't, we should
just comit the data-locating metadata and add a call from the existing
implementation to a filesystem-provided method function.
> So sync_file_range() is basically useless as a data integrity
> operation. It's not a substitute for fdatasync(). Therefore why
> would you ever use it?
It's not useful *today*. But we could make it useful. The power of
the existing bit flags is useful, although granted it can be confusing
for the users who aren't haven't meditated deeply upon the writeback
code paths. I thought it was clear, but if it isn't we can improve
the documentation.
More to the point, given that we already have sync_file_range(2), I
would argue that it would be unfortunate to create a new system call
that has overlapping functionality but which is not a superset of
sync_file_range(2). Maybe Nick has a good reason for starting with an
entirely new system call, but if so, it would be nice if it at least
have the power of sync_file_range(2), in addition to having new
functionality.
> > But the interface does make a lot of sense. (But maybe that's because
> > I've spent too much time staring at all of the page writeback call
> > paths, and compared to that even string theory is pretty simple. :-)
>
> Yeah, sounds like you have studied both and gained the proper perspective :-)
>
> I suspect all the fsync-related uncertainty about whether it really
> works, including interactions with filesystem quirks, reliable and
> potential bugs in filesystems, would be much easier to get right if we
> only had a way to repeatably test it.
The answer today is sync_file_range(2) is purely a creature of the MM
subsystem, and doesn't do anything with respect to filesystem metadata
or barriers. Once you understand that, the rest of the man page is
pretty simple, I think. :-)
Whether or not it should *continue* to be that way in the future is a
different discussion, of course.
> I'm thinking running a kernel inside a VM invoked and
> stopped/killed/branched is the only realistic way to test that all
> data is committed properly, with/without necessary I/O barriers, and
> recovers properly after a crash and resume. Fortunately we have good
> VMs now, such a test seems very doable. It would help with testing
> journalling & recovery behaviour too.
>
> Is there such a test or related tool already?
I don't know of one. I agree it would be a useful thing to have. It
won't test barriers at the driver level, but it would be good for
testing the everything above that.
- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2009-04-23 21:13 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-04-23 0:12 [RFC PATCH] fpathconf() for fsync() behavior Valerie Aurora Henson
2009-04-23 5:17 ` Andrew Morton
2009-04-23 11:21 ` Jamie Lokier
2009-04-23 12:42 ` Theodore Tso
2009-04-23 12:48 ` Jeff Garzik
2009-04-23 14:10 ` Theodore Tso
2009-04-23 16:16 ` Valerie Aurora Henson
2009-04-26 9:26 ` Pavel Machek
2009-04-23 16:43 ` Jamie Lokier
2009-04-23 17:29 ` Theodore Tso
2009-04-23 20:44 ` fsync_range_with_flags() - improving sync_file_range() Jamie Lokier
2009-04-23 21:13 ` Theodore Tso [this message]
2009-04-23 22:03 ` Jamie Lokier
2009-04-23 16:04 ` [RFC PATCH] fpathconf() for fsync() behavior Valerie Aurora Henson
2009-04-23 16:10 ` Ric Wheeler
2009-04-23 17:23 ` Jamie Lokier
2009-04-23 11:11 ` Christoph Hellwig
2009-04-23 15:49 ` Valerie Aurora Henson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20090423211305.GN2723@mit.edu \
--to=tytso@mit.edu \
--cc=akpm@linux-foundation.org \
--cc=chris.mason@oracle.com \
--cc=jamie@shareable.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=npiggin@suse.de \
--cc=rwheeler@redhat.com \
--cc=sandeen@redhat.com \
--cc=vaurora@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).