From: Dave Chinner <david@fromorbit.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Amir Goldstein <amir73il@gmail.com>, Jan Kara <jack@suse.cz>,
"Darrick J . Wong" <darrick.wong@oracle.com>,
Chris Mason <clm@fb.com>, Al Viro <viro@zeniv.linux.org.uk>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
linux-xfs <linux-xfs@vger.kernel.org>,
Ext4 <linux-ext4@vger.kernel.org>,
Linux Btrfs <linux-btrfs@vger.kernel.org>,
Linux API <linux-api@vger.kernel.org>
Subject: Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA
Date: Sat, 1 Jun 2019 08:45:49 +1000 [thread overview]
Message-ID: <20190531224549.GF29573@dread.disaster.area> (raw)
In-Reply-To: <20190531164136.GA3066@mit.edu>
On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote:
> On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> > What do you think of:
> >
> > "AT_ATOMIC_DATA (since Linux 5.x)
> > A filesystem which accepts this flag will guarantee that if the linked file
> > name exists after a system crash, then all of the data written to the file
> > and all of the file's metadata at the time of the linkat(2) call will be
> > visible.
>
> ".... will be visible after the the file system is remounted". (Never
> hurts to be explicit.)
>
> > The way to achieve this guarantee on old kernels is to call fsync (2)
> > before linking the file, but doing so will also results in flushing of
> > volatile disk caches.
> >
> > A filesystem which accepts this flag does NOT
> > guarantee that any of the file hardlinks will exist after a system crash,
> > nor that the last observed value of st_nlink (see stat (2)) will persist."
> >
>
> This is I think more precise:
>
> This guarantee can be achieved by calling fsync(2) before linking
> the file, but there may be more performant ways to provide these
> semantics. In particular, note that the use of the AT_ATOMIC_DATA
> flag does *not* guarantee that the new link created by linkat(2)
> will be persisted after a crash.
So here's the *implementation* problem I see with this definition of
AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is
no guarantee that the data is on disk or that the link is present.
However:
linkat(dirfd, name, AT_ATOMIC_DATA);
fsync(dirfd);
Suddenly changes all that.
i.e. when we fsync(dirfd) we guarantee that "name" is present in the
directory and because we used AT_ATOMIC_DATA it implies that the
data pointed to by "name" must be present on disk. IOWs, what was
once a pure directory sync operation now *must* fsync all the child
inodes that have been linkat(AT_ATOMIC_DATA) since the last time the
direct has been made stable.
IOWs, the described AT_ATOMIC_DATA "we don't have to write the data
during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth
the pixels it is written on - it just moves all the complexity to
directory fsync, and that's /already/ a behavioural minefield.
IMO, the "work-around" of forcing filesystems to write back
destination inodes during a link() operation is just nasty and will
just end up with really weird performance anomalies occurring in
production systems. That's not really a solution, either, especially
as it is far, far faster for applications to use AIO_FSYNC and then
on the completion callback run a normal linkat() operation...
Hence, if I've understood these correctly, then I'll be recommending
that XFS follows this:
> We should also document that a file system which does not implement
> this flag MUST return EINVAL if it is passed this flag to linkat(2).
and returns -EINVAL to these flags because we do not have the change
tracking infrastructure to handle these directory fsync semantics.
I also suspect that, even if we could track this efficiently, we
can't do the flushing atomically because of locking order
constraints between directories, regular files, pages in the page
cache, etc.
Given that we can already use AIO to provide this sort of ordering,
and AIO is vastly faster than synchronous IO, I don't see any point
in adding complex barrier interfaces that can be /easily implemented
in userspace/ using existing AIO primitives. You should start
thinking about expanding libaio with stuff like
"link_after_fdatasync()" and suddenly the whole problem of
filesystem data vs metadata ordering goes away because the
application directly controls all ordering without blocking and
doesn't need to care what the filesystem under it does....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2019-05-31 22:45 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-05-27 17:26 [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA Amir Goldstein
2019-05-28 20:06 ` Darrick J. Wong
2019-05-29 5:58 ` Amir Goldstein
2019-05-28 20:26 ` Theodore Ts'o
2019-05-29 5:38 ` Amir Goldstein
2019-05-31 15:21 ` Amir Goldstein
2019-05-31 16:41 ` Theodore Ts'o
2019-05-31 17:22 ` Amir Goldstein
2019-05-31 19:21 ` Theodore Ts'o
2019-05-31 22:45 ` Dave Chinner [this message]
2019-05-31 23:28 ` Dave Chinner
2019-06-01 8:01 ` Amir Goldstein
2019-06-03 4:25 ` Dave Chinner
2019-06-03 6:17 ` Amir Goldstein
2019-06-01 7:21 ` Amir Goldstein
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190531224549.GF29573@dread.disaster.area \
--to=david@fromorbit.com \
--cc=amir73il@gmail.com \
--cc=clm@fb.com \
--cc=darrick.wong@oracle.com \
--cc=jack@suse.cz \
--cc=linux-api@vger.kernel.org \
--cc=linux-btrfs@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox