Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: Amir Goldstein <amir73il@gmail.com>,
	Vijay Chidambaram <vijay@cs.utexas.edu>,
	lsf-pc@lists.linux-foundation.org,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jayashree Mohan <jaya@cs.utexas.edu>,
	Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
	lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Thu, 9 May 2019 12:58:45 +1000	[thread overview]
Message-ID: <20190509025845.GV1454@dread.disaster.area> (raw)
In-Reply-To: <20190509022013.GC7031@mit.edu>

On Wed, May 08, 2019 at 10:20:13PM -0400, Theodore Ts'o wrote:
> On Thu, May 09, 2019 at 11:43:27AM +1000, Dave Chinner wrote:
> > 
> > .... the whole point of SOMC is that allows filesystems to avoid
> > dragging external metadata into fsync() operations /unless/ there's
> > a user visible ordering dependency that must be maintained between
> > objects.  If all you are doing is stabilising file data in a stable
> > file/directory, then independent, incremental journaling of the
> > fsync operations on that file fit the SOMC model just fine.
> 
> Well, that's not what Vijay's crash consistency guarantees state.  It
> guarantees quite a bit more than what you've written above.  Which is
> my concern.

SOMC does not defining crash consistency rules - it defines change
dependecies and how ordering and atomicity impact the dependency
graph. How other people have interpreted that is out of my control.

> It came up as suggested alternative during Ric Wheeler's "Async all
> the things" session.  The problem he was trying to address was
> programs (perhaps userspace file servers) who need to fsync a large
> number of files at the same time.  The problem with his suggested
> solution (which we have for AIO and io_uring already) of having the
> program issue a large number of asynchronous fsync's and then waiting
> for them all, is that the back-end interface is a work queue, so there
> is a lot of effective serialization that takes place.

We got linear scaling out to device bandwidth and/or IOPS limits
with bulk fsync benchmarks on XFS with that simple workqueue
implementation.

If there's problems, then I'd suggest that people should be
reporting bugs to the developers of the AIO_FSYNC code (i.e.
Christoph and myself) or providing patches to improve it so these
problems go away.

A new syscall with essentially the same user interface doesn't
guarantee that these implementation problems will be solved.


> > > The semantics would be that when the
> > > fsync2() successfully returns, all of the guarantees of fsync() or
> > > fdatasync() requested by the list of file descriptors and flags would
> > > be satisfied.  This would allow file systems to more optimally fsync a
> > > batch of files, for example by implementing data integrity writebacks
> > > for all of the files, followed by a single journal commit to guarantee
> > > persistence for all of the metadata changes.
> > 
> > What happens when you get writeback errors on only some of the fds?
> > How do you report the failures and what do you do with the journal
> > commit on partial success?
> 
> Well, one approach would be to pass back the errors in the structure.
> Say something like this:
> 
>      int fsync2(int len, struct fsync_req[]);
> 
>      struct fsync_req {
>           int	fd;        /* IN */
> 	  int	flags;	   /* IN */
> 	  int	retval;    /* OUT */
>      };

So it's essentially identical to the AIO_FSYNC interface, except
that it is synchronous.

> As far as what do you do with the journal commit on partial success,
> this are no atomic, "all or nothing" guarantees with this interface.
> It is implementation specific whether there would be one or more file
> system commits necessary before fsync2 returned.

IOWs, same guarantees as AIO_FSYNC.

> > Of course, this ignores the elephant in the room: applications can
> > /already do this/ using AIO_FSYNC and have individual error status
> > for each fd. Not to mention that filesystems already batch
> > concurrent fsync journal commits into a single operation. I'm not
> > seeing the point of a new syscall to do this right now....
> 
> But it doesn't work very well, because the implementation uses a
> workqueue.

Then fix the fucking implementation!

Sheesh! Did LSFMM include a free lobotomy for participants, or
something?

-Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2019-05-09  2:58 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
2019-05-02 16:12 ` Amir Goldstein
2019-05-02 17:11   ` Vijay Chidambaram
2019-05-02 17:39     ` Amir Goldstein
2019-05-03  2:30       ` Theodore Ts'o
2019-05-03  3:15         ` Vijay Chidambaram
2019-05-03  9:45           ` Theodore Ts'o
2019-05-04  0:17             ` Vijay Chidambaram
2019-05-04  1:43               ` Theodore Ts'o
2019-05-07 18:38                 ` Jan Kara
2019-05-03  4:16         ` Amir Goldstein
2019-05-03  9:58           ` Theodore Ts'o
2019-05-03 14:18             ` Amir Goldstein
2019-05-09  2:36             ` Dave Chinner
2019-05-09  1:43         ` Dave Chinner
2019-05-09  2:20           ` Theodore Ts'o
2019-05-09  2:58             ` Dave Chinner [this message]
2019-05-09  3:31               ` Theodore Ts'o
2019-05-09  5:19                 ` Darrick J. Wong
2019-05-09  5:02             ` Vijay Chidambaram
2019-05-09  5:37               ` Darrick J. Wong
2019-05-09 15:46               ` Theodore Ts'o
2019-05-09  8:47           ` Amir Goldstein
2019-05-02 21:05   ` Darrick J. Wong
2019-05-02 22:19     ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190509025845.GV1454@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=amir73il@gmail.com \
    --cc=clm@fb.com \
    --cc=darrick.wong@oracle.com \
    --cc=fdmanana@suse.com \
    --cc=jack@suse.cz \
    --cc=jaya@cs.utexas.edu \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=lwn@lwn.net \
    --cc=tytso@mit.edu \
    --cc=vijay@cs.utexas.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.