linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Theodore Ts'o" <tytso@mit.edu>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Vijay Chidambaram <vijay@cs.utexas.edu>,
	lsf-pc@lists.linux-foundation.org,
	Dave Chinner <david@fromorbit.com>,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jayashree Mohan <jaya@cs.utexas.edu>,
	Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
	lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Thu, 2 May 2019 22:30:43 -0400	[thread overview]
Message-ID: <20190503023043.GB23724@mit.edu> (raw)
In-Reply-To: <CAOQ4uxjNWLvh7EmizA7PjmViG5nPMsvB2UbHW6-hhbZiLadQTA@mail.gmail.com>

On Thu, May 02, 2019 at 01:39:47PM -0400, Amir Goldstein wrote:
> > The expectation is that applications will use this, and then rename
> > the O_TMPFILE file over the original file. Is this correct? If so, is
> > there also an implied barrier between O_TMPFILE metadata and the
> > rename?

In the case of O_TMPFILE, the file can be brought into the namespace
using something like:

linkat(AT_FDCWD, "/proc/self/fd/42", AT_FDCWD, pathname, AT_SYMLINK_FOLLOW);

it's not using rename.

To be clear, this discussion happened in the hallway, and it's not
clear it had full support by everyone.  After our discussion, some of
us came up with an example where forcing a call to
filemap_write_and_wait() before the linkat(2) might *not* be the right
thing.  Suppose some browser wanted to wait until a file was fully(
downloaded before letting it appear in the directory --- but what was
being downloaded was a 4 GiB DVD image (say, a distribution's install
media).  If the download was done using O_TMPFILE followed by
linkat(2), that might be a case where forcing the data blocks to disk
before allowing the linkat(2) to proceed might not be what the
application or the user would want.

So it might be that we will need to add a linkat flag to indicate that
we want the kernel to call filemap_write_and_wait() before making the
metadata changes in linkat(2).

> For replacing an existing file with another the same could be
> achieved with renameat2(AT_FDCWD, tempname, AT_FDCWD, newname,
> RENAME_ATOMIC). There is no need to create the tempname
> file using O_TMPFILE in that case, but if you do, the RENAME_ATOMIC
> flag would be redundant.
> 
> RENAME_ATOMIC flag is needed because directories and non regular
> files cannot be created using O_TMPFILE.

I think there's much less consensus about this.  Again, most of this
happened in a hallway conversation.

> > Where does this land us on the discussion about documenting
> > file-system crash-recovery guarantees? Has that been deemed not
> > necessary?
> 
> Can't say for sure.
> Some filesystem maintainers hold on to the opinion that they do
> NOT wish to have a document describing existing behavior of specific
> filesystems, which is large parts of the document that your group posted.
> 
> They would rather that only the guaranties of the APIs are documented
> and those should already be documented in man pages anyway - if they
> are not, man pages could be improved.
> 
> I am not saying there is no room for a document that elaborates on those
> guaranties. I personally think that could be useful and certainly think that
> your group's work for adding xfstest coverage for API guaranties is useful.

Again, here is my concern.  If we promise that ext4 will always obey
Dave Chinner's SOMC model, it would forever rule out Daejun Park and
Dongkun Shin's "iJournaling: Fine-grained journaling for improving the
latency of fsync system call"[1] published in Usenix ATC 2017.

[1] https://www.usenix.org/system/files/conference/atc17/atc17-park.pdf

That's because this provides a fast fsync() using an incremental
journal.  This fast fsync would cause the metadata associated with the
inode being fsync'ed to be persisted after the crash --- ahead of
metadata changes to other, potentially completely unrelated files,
which would *not* be persisted after the crash.  Fine grained
journalling would provide all of the guarantee all of the POSIX, and
for applications that only care about the single file being fsync'ed
-- they would be happy.  BUT, it violates the proposed crash
consistency guarantees.

So if the crash consistency guarantees forbids future innovations
where applications might *want* a fast fsync() that doesn't drag
unrelated inodes into the persistence guarantees, is that really what
we want?  Do we want to forever rule out various academic
investigations such as Park and Shin's because "it violates the crash
consistency recovery model"?  Especially if some applications don't
*need* the crash consistency model?

						- Ted

P.S.  I feel especially strong about this because I'm working with an
engineer currently trying to implement a simplified version of Park
and Shin's proposal...  So this is not a hypothetical concern of mine.
I'd much rather not invalidate all of this engineer's work to date,
especially since there is a published paper demonstrating that for
some workloads (such as sqlite), this approach can be a big win.

P.P.S.  One of the other discussions that did happen during the main
LSF/MM File system session, and for which there was general agreement
across a number of major file system maintainers, was a fsync2()
system call which would take a list of file descriptors (and flags)
that should be fsync'ed.  The semantics would be that when the
fsync2() successfully returns, all of the guarantees of fsync() or
fdatasync() requested by the list of file descriptors and flags would
be satisfied.  This would allow file systems to more optimally fsync a
batch of files, for example by implementing data integrity writebacks
for all of the files, followed by a single journal commit to guarantee
persistence for all of the metadata changes.

  reply	other threads:[~2019-05-03  2:31 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
2019-05-02 16:12 ` Amir Goldstein
2019-05-02 17:11   ` Vijay Chidambaram
2019-05-02 17:39     ` Amir Goldstein
2019-05-03  2:30       ` Theodore Ts'o [this message]
2019-05-03  3:15         ` Vijay Chidambaram
2019-05-03  9:45           ` Theodore Ts'o
2019-05-04  0:17             ` Vijay Chidambaram
2019-05-04  1:43               ` Theodore Ts'o
2019-05-07 18:38                 ` Jan Kara
2019-05-03  4:16         ` Amir Goldstein
2019-05-03  9:58           ` Theodore Ts'o
2019-05-03 14:18             ` Amir Goldstein
2019-05-09  2:36             ` Dave Chinner
2019-05-09  1:43         ` Dave Chinner
2019-05-09  2:20           ` Theodore Ts'o
2019-05-09  2:58             ` Dave Chinner
2019-05-09  3:31               ` Theodore Ts'o
2019-05-09  5:19                 ` Darrick J. Wong
2019-05-09  5:02             ` Vijay Chidambaram
2019-05-09  5:37               ` Darrick J. Wong
2019-05-09 15:46               ` Theodore Ts'o
2019-05-09  8:47           ` Amir Goldstein
2019-05-02 21:05   ` Darrick J. Wong
2019-05-02 22:19     ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190503023043.GB23724@mit.edu \
    --to=tytso@mit.edu \
    --cc=amir73il@gmail.com \
    --cc=clm@fb.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=fdmanana@suse.com \
    --cc=jack@suse.cz \
    --cc=jaya@cs.utexas.edu \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=lwn@lwn.net \
    --cc=vijay@cs.utexas.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).