From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org,
Dave Chinner <david@fromorbit.com>, Theodore Tso <tytso@mit.edu>,
Jan Kara <jack@suse.cz>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Jayashree Mohan <jaya@cs.utexas.edu>,
Vijaychidambaram Velayudhan Pillai <vijay@cs.utexas.edu>,
Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Thu, 2 May 2019 14:05:24 -0700 [thread overview]
Message-ID: <20190502210524.GI5200@magnolia> (raw)
In-Reply-To: <CAOQ4uxgEicLTA4LtV2fpvx7okEEa=FtbYE7Qa_=JeVEGXz40kw@mail.gmail.com>
On Thu, May 02, 2019 at 12:12:22PM -0400, Amir Goldstein wrote:
> On Sat, Apr 27, 2019 at 5:00 PM Amir Goldstein <amir73il@gmail.com> wrote:
> >
> > Suggestion for another filesystems track topic.
> >
> > Some of you may remember the emotional(?) discussions that ensued
> > when the crashmonkey developers embarked on a mission to document
> > and verify filesystem crash recovery guaranties:
> >
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxj8YpYPPdEvAvKPKXO7wdBg6T1O3osd6fSPFKH9j=i2Yg@mail.gmail.com/
> >
> > There are two camps among filesystem developers and every camp
> > has good arguments for wanting to document existing behavior and for
> > not wanting to document anything beyond "use fsync if you want any guaranty".
> >
> > I would like to take a suggestion proposed by Jan on a related discussion:
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjQx+TO3Dt7TA3ocXnNxbr3+oVyJLYUSpv4QCt_Texdvw@mail.gmail.com/
> >
> > and make a proposal that may be able to meet the concerns of
> > both camps.
> >
> > The proposal is to add new APIs which communicate
> > crash consistency requirements of the application to the filesystem.
> >
> > Example API could look like this:
> > renameat2(..., RENAME_METADATA_BARRIER | RENAME_DATA_BARRIER)
> > It's just an example. The API could take another form and may need
> > more barrier types (I proposed to use new file_sync_range() flags).
> >
> > The idea is simple though.
> > METADATA_BARRIER means all the inode metadata will be observed
> > after crash if rename is observed after crash.
> > DATA_BARRIER same for file data.
> > We may also want a "ALL_METADATA_BARRIER" and/or
> > "METADATA_DEPENDENCY_BARRIER" to more accurately
> > describe what SOMC guaranties actually provide today.
> >
> > The implementation is also simple. filesystem that currently
> > have SOMC behavior don't need to do anything to respect
> > METADATA_BARRIER and only need to call
> > filemap_write_and_wait_range() to respect DATA_BARRIER.
> > filesystem developers are thus not tying their hands w.r.t future
> > performance optimizations for operations that are not explicitly
> > requesting a barrier.
> >
>
> An update: Following the LSF session on $SUBJECT I had a discussion
> with Ted, Jan and Chris.
>
> We were all in agreement that linking an O_TMPFILE into the namespace
> is probably already perceived by users as the barrier/atomic operation that
> I am trying to describe.
>
> So at least maintainers of btrfs/ext4/ext2 are sympathetic to the idea of
> providing the required semantics when linking O_TMPFILE *as long* as
> the semantics are properly documented.
>
> This is what open(2) man page has to say right now:
>
> * Creating a file that is initially invisible, which is then
> populated with data
> and adjusted to have appropriate filesystem attributes (fchown(2),
> fchmod(2), fsetxattr(2), etc.) before being atomically linked into the
> filesystem in a fully formed state (using linkat(2) as described above).
>
> The phrase that I would like to add (probably in link(2) man page) is:
> "The filesystem provided the guaranty that after a crash, if the linked
> O_TMPFILE is observed in the target directory, than all the data and
"if the linked O_TMPFILE is observed" ... meaning that if we can't
recover all the data+metadata information then it's ok to obliterate the
file? Is the filesystem allowed to drop the tmpfile data if userspace
links the tmpfile into a directory but doesn't fsync the directory?
TBH I would've thought the basis of the RENAME_ATOMIC (and LINK_ATOMIC?)
user requirement would be "Until I say otherwise I want always to be
able to read <data> from this given string <pathname>."
(vs. regular Unix rename/link where we make you specify how much you
care about that by hitting us on the head with a file fsync and then a
directory fsync.)
> metadata modifications made to the file before being linked are also
> observed."
>
> For some filesystems, btrfs in farticular, that would mean an implicit
> fsync on the linked inode. On other filesystems, ext4/xfs in particular
> that would only require at least committing delayed allocations, but
> will NOT require inode fsync nor journal commit/flushing disk caches.
I don't think it does much good to commit delalloc blocks but not flush
dirty overwrites, and I don't think it makes a lot of sense to flush out
overwrite data without also pushing out the inode metadata too.
FWIW I'm ok with the "Here's a 'I'm really serious' flag that carries
with it a full fsync, though how to sell developers on using it?
> I would like to hear the opinion of XFS developers and filesystem
> maintainers who did not attend the LSF session.
I miss you all too. Sorry I couldn't make it this year. :(
> I have no objection to adding an opt-in LINK_ATOMIC flag
> and pass it down to filesystems instead of changing behavior and
> patching stable kernels, but I prefer the latter.
>
> I believe this should have been the semantics to begin with
> if for no other reason, because users would expect it regardless
> of whatever we write in manual page and no matter how many
> !!!!!!!! we use for disclaimers.
>
> And if we can all agree on that, then O_TMPFILE is quite young
> in historic perspective, so not too late to call the expectation gap
> a bug and fix it.(?)
Why would linking an O_TMPFILE be a special case as opposed to making
hard links in general? If you hardlink a dirty file then surely you'd
also want to be able to read the data from the new location?
> Taking this another step forward, if we agree on the language
> I used above to describe the expected behavior, then we can
> add an opt-in RENAME_ATOMIC flag to provide the same
> semantics and document it in the same manner (this functionality
> is needed for directories and non regular files) and all there is left
> is the fun part of choosing the flag name ;-)
Will have to think about /that/ some more.
--D
>
> Thanks,
> Amir.
next prev parent reply other threads:[~2019-05-02 21:05 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
2019-05-02 16:12 ` Amir Goldstein
2019-05-02 17:11 ` Vijay Chidambaram
2019-05-02 17:39 ` Amir Goldstein
2019-05-03 2:30 ` Theodore Ts'o
2019-05-03 3:15 ` Vijay Chidambaram
2019-05-03 9:45 ` Theodore Ts'o
2019-05-04 0:17 ` Vijay Chidambaram
2019-05-04 1:43 ` Theodore Ts'o
2019-05-07 18:38 ` Jan Kara
2019-05-03 4:16 ` Amir Goldstein
2019-05-03 9:58 ` Theodore Ts'o
2019-05-03 14:18 ` Amir Goldstein
2019-05-09 2:36 ` Dave Chinner
2019-05-09 1:43 ` Dave Chinner
2019-05-09 2:20 ` Theodore Ts'o
2019-05-09 2:58 ` Dave Chinner
2019-05-09 3:31 ` Theodore Ts'o
2019-05-09 5:19 ` Darrick J. Wong
2019-05-09 5:02 ` Vijay Chidambaram
2019-05-09 5:37 ` Darrick J. Wong
2019-05-09 15:46 ` Theodore Ts'o
2019-05-09 8:47 ` Amir Goldstein
2019-05-02 21:05 ` Darrick J. Wong [this message]
2019-05-02 22:19 ` Amir Goldstein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190502210524.GI5200@magnolia \
--to=darrick.wong@oracle.com \
--cc=amir73il@gmail.com \
--cc=clm@fb.com \
--cc=david@fromorbit.com \
--cc=fdmanana@suse.com \
--cc=jack@suse.cz \
--cc=jaya@cs.utexas.edu \
--cc=linux-fsdevel@vger.kernel.org \
--cc=lsf-pc@lists.linux-foundation.org \
--cc=lwn@lwn.net \
--cc=tytso@mit.edu \
--cc=vijay@cs.utexas.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.