Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Theodore Ts'o" <tytso@mit.edu>
To: Dave Chinner <david@fromorbit.com>
Cc: Amir Goldstein <amir73il@gmail.com>,
	Vijay Chidambaram <vijay@cs.utexas.edu>,
	lsf-pc@lists.linux-foundation.org,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jayashree Mohan <jaya@cs.utexas.edu>,
	Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
	lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Wed, 8 May 2019 22:20:13 -0400	[thread overview]
Message-ID: <20190509022013.GC7031@mit.edu> (raw)
In-Reply-To: <20190509014327.GT1454@dread.disaster.area>

On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote:
> 
> .... the whole point of SOMC is that allows filesystems to avoid
> dragging external metadata into fsync() operations /unless/ there's
> a user visible ordering dependency that must be maintained between
> objects.  If all you are doing is stabilising file data in a stable
> file/directory, then independent, incremental journaling of the
> fsync operations on that file fit the SOMC model just fine.

Well, that's not what Vijay's crash consistency guarantees state.  It
guarantees quite a bit more than what you've written above.  Which is
my concern.

> > P.P.S.  One of the other discussions that did happen during the main
> > LSF/MM File system session, and for which there was general agreement
> > across a number of major file system maintainers, was a fsync2()
> > system call which would take a list of file descriptors (and flags)
> > that should be fsync'ed.
> 
> Hmmmm, that wasn't on the agenda, and nobody has documented it as
> yet.

It came up as suggested alternative during Ric Wheeler's "Async all
the things" session.  The problem he was trying to address was
programs (perhaps userspace file servers) who need to fsync a large
number of files at the same time.  The problem with his suggested
solution (which we have for AIO and io_uring already) of having the
program issue a large number of asynchronous fsync's and then waiting
for them all, is that the back-end interface is a work queue, so there
is a lot of effective serialization that takes place.

> > The semantics would be that when the
> > fsync2() successfully returns, all of the guarantees of fsync() or
> > fdatasync() requested by the list of file descriptors and flags would
> > be satisfied.  This would allow file systems to more optimally fsync a
> > batch of files, for example by implementing data integrity writebacks
> > for all of the files, followed by a single journal commit to guarantee
> > persistence for all of the metadata changes.
> 
> What happens when you get writeback errors on only some of the fds?
> How do you report the failures and what do you do with the journal
> commit on partial success?

Well, one approach would be to pass back the errors in the structure.
Say something like this:

     int fsync2(int len, struct fsync_req[]);

     struct fsync_req {
          int	fd;        /* IN */
	  int	flags;	   /* IN */
	  int	retval;    /* OUT */
     };

As far as what do you do with the journal commit on partial success,
this are no atomic, "all or nothing" guarantees with this interface.
It is implementation specific whether there would be one or more file
system commits necessary before fsync2 returned.

> Of course, this ignores the elephant in the room: applications can
> /already do this/ using AIO_FSYNC and have individual error status
> for each fd. Not to mention that filesystems already batch
> concurrent fsync journal commits into a single operation. I'm not
> seeing the point of a new syscall to do this right now....

But it doesn't work very well, because the implementation uses a
workqueue.  Sure, you could create N worker threads for N fd's that
you want to fsync, and then file system can batch the fsync requests.
But wouldn't be so much simpler to give a list of fd's that should be
fsync'ed to the file system?  That way you don't have to do lots of
work to split up the work so they can be submitted in parallel, only
to have the file system batch up all of the requests being issued from
all of those kernel threads.

So yes, it's identical to the interfaces we already have.  Just like
select(2), poll(2) and epoll(2) are functionality identical...

     	  	     	    	   	 - Ted

next prev parent reply	other threads:[~2019-05-09  2:20 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
2019-05-02 16:12 ` Amir Goldstein
2019-05-02 17:11   ` Vijay Chidambaram
2019-05-02 17:39     ` Amir Goldstein
2019-05-03  2:30       ` Theodore Ts'o
2019-05-03  3:15         ` Vijay Chidambaram
2019-05-03  9:45           ` Theodore Ts'o
2019-05-04  0:17             ` Vijay Chidambaram
2019-05-04  1:43               ` Theodore Ts'o
2019-05-07 18:38                 ` Jan Kara
2019-05-03  4:16         ` Amir Goldstein
2019-05-03  9:58           ` Theodore Ts'o
2019-05-03 14:18             ` Amir Goldstein
2019-05-09  2:36             ` Dave Chinner
2019-05-09  1:43         ` Dave Chinner
2019-05-09  2:20           ` Theodore Ts'o [this message]
2019-05-09  2:58             ` Dave Chinner
2019-05-09  3:31               ` Theodore Ts'o
2019-05-09  5:19                 ` Darrick J. Wong
2019-05-09  5:02             ` Vijay Chidambaram
2019-05-09  5:37               ` Darrick J. Wong
2019-05-09 15:46               ` Theodore Ts'o
2019-05-09  8:47           ` Amir Goldstein
2019-05-02 21:05   ` Darrick J. Wong
2019-05-02 22:19     ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190509022013.GC7031@mit.edu \
    --to=tytso@mit.edu \
    --cc=amir73il@gmail.com \
    --cc=clm@fb.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=fdmanana@suse.com \
    --cc=jack@suse.cz \
    --cc=jaya@cs.utexas.edu \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=lwn@lwn.net \
    --cc=vijay@cs.utexas.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.