Re: [TOPIC] Extending the filesystem crash recovery guaranties contract

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Theodore Ts'o" <tytso@mit.edu>
To: Vijay Chidambaram <vijay@cs.utexas.edu>
Cc: Dave Chinner <david@fromorbit.com>,
	Amir Goldstein <amir73il@gmail.com>,
	lsf-pc@lists.linux-foundation.org,
	"Darrick J. Wong" <darrick.wong@oracle.com>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	Jayashree Mohan <jaya@cs.utexas.edu>,
	Filipe Manana <fdmanana@suse.com>, Chris Mason <clm@fb.com>,
	lwn@lwn.net
Subject: Re: [TOPIC] Extending the filesystem crash recovery guaranties contract
Date: Thu, 9 May 2019 11:46:35 -0400	[thread overview]
Message-ID: <20190509154635.GF29703@mit.edu> (raw)
In-Reply-To: <CAHWVdUVViC_EJm3K7MfvfSQ+G1u=SX=RXAZWPYjZuS16JWxNEw@mail.gmail.com>

On Thu, May 09, 2019 at 12:02:17AM -0500, Vijay Chidambaram wrote:
> As we have stated on multiple times on this and other threads, the
> intention is *not* to come up with one set of crash-recovery
> guarantees that every Linux file system must abide by forever. Ted,
> you keep repeating this, though we have never said this was our
> intention.
> 
> The intention behind this effort is to simply document the
> crash-recovery guarantees provided today by different Linux file
> systems. Ted, you question why this is required at all, and why we
> simply can't use POSIX and man pages.

But who is this documentation targeted towards?  Who is it intended to
benefit?  Most application authors do not write applications with
specific file systems in mind.  And even if they do, they can't
control how their users are going to use it.

> FWIW, I think the position of "if we don't write it down, application
> developers can't depend on it" is wrong. Even with nothing written
> down, developers noticed they could skip fsync() in ext3 when
> atomically updating files with rename(). This lead to the whole ext4
> rename-and-delayed-allocation problem. The much better path, IMO, is
> to document the current set of guarantees given by different file
> systems, and talk about what is intended and what is not. This would
> give application developers much better guidance in writing
> applications.

If we were to provide that nuance, that would be much better, I would
agree.  It's not what the current crash consistency guarantees
provides, alas.  I'd also want to talk about what is guaranteed
*first*; documenting the current state of affairs, some of which may
be subject to change and the result of the implementation, is far less
important.  So I'd prefer that "documentation of current behavior" be
the last thing in the document --- perhaps in an appendix --- and not
the headliner.

Indeed, I'd use the ext3 O_PONIES discussion as a prime example of the
risk if we were to just "document current practice" and stop there.
It's the fact that your crash consistency guarantees draft, claims to
"document current practice", and at the same time, uses the word
"guarantee" which causes red flags to go up for me.

If we could separate those two, that would be very helpful.  And if
the current POSIX guarantees are too vague, my preference would be to
first determine what application authors would find more useful in
terms stricter guarantees, and provide those guarantees as we find
them.  We can always add more guarantees later.  Taking guarantees
away is much harder.  And guarantees by defintion always restrict
freedom of action, so this is an engineering tradeoff.  Let's provide
those guarantees when it actually improves application performance,
and not Just Because.

It might also be that defining new system calls, like fbarrier() and
fdatabarrier() is a better approach rather than retconning new
semantics on top of fsync().  I just think a principled design
approach is better rather than taking existing semantics and slapping
the word "guarantee" in the title of said documentation.

I will also say that I have no problems with documenting strong
metadata ordering if it has nothing to do with fsync().  That makes
sense.  The moment that you try to also bring data integrity into the
mix, and give examples of what happens if you call fsync(), that it
goes beyond strong metadata ordering.  So if you want to document what
happens without fsync, ext4 can probably get on board with them.
Unfortuantely, in addition to including the word "guarantee", the
current crash consistency draft also includes the word "fsync".

> 4. Apart from developers, a document like this would also help
> academic researchers understand the current state-of-the-art in
> crash-recovery guarantees and the different choices made by different
> file systems. It is non-trivial to understand this without
> documentation.

It's also very hard to undertand this without taking performance
constraints and implementation choices into account.  It's trivially
easy to give super-strong crash-recovery guarantees.  But if it
sacrifices performance, is it really "state-of-the-art"?

Worse, different applications may want different guarantees, and may
want different crash consistency vs. performance tradeoffs.  This is
why in general, the concept of providing new interfaces where the
application can state more explicitly what they want is much more
appealing to me.

When I have discussions with Amir, he doesn't just want strong
guarantees; he wants specific guarantees with zero overhead, and our
discussions have been in how to we manage that tension between those
two goals.  And it's much easier to achieve this in terms of very
specific cases, such as what happens when you link an O_TMPFILE file
into a directory.

Cheers,

   		     	      	 	   	- Ted

next prev parent reply	other threads:[~2019-05-09 15:47 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-27 21:00 [TOPIC] Extending the filesystem crash recovery guaranties contract Amir Goldstein
2019-05-02 16:12 ` Amir Goldstein
2019-05-02 17:11   ` Vijay Chidambaram
2019-05-02 17:39     ` Amir Goldstein
2019-05-03  2:30       ` Theodore Ts'o
2019-05-03  3:15         ` Vijay Chidambaram
2019-05-03  9:45           ` Theodore Ts'o
2019-05-04  0:17             ` Vijay Chidambaram
2019-05-04  1:43               ` Theodore Ts'o
2019-05-07 18:38                 ` Jan Kara
2019-05-03  4:16         ` Amir Goldstein
2019-05-03  9:58           ` Theodore Ts'o
2019-05-03 14:18             ` Amir Goldstein
2019-05-09  2:36             ` Dave Chinner
2019-05-09  1:43         ` Dave Chinner
2019-05-09  2:20           ` Theodore Ts'o
2019-05-09  2:58             ` Dave Chinner
2019-05-09  3:31               ` Theodore Ts'o
2019-05-09  5:19                 ` Darrick J. Wong
2019-05-09  5:02             ` Vijay Chidambaram
2019-05-09  5:37               ` Darrick J. Wong
2019-05-09 15:46               ` Theodore Ts'o [this message]
2019-05-09  8:47           ` Amir Goldstein
2019-05-02 21:05   ` Darrick J. Wong
2019-05-02 22:19     ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190509154635.GF29703@mit.edu \
    --to=tytso@mit.edu \
    --cc=amir73il@gmail.com \
    --cc=clm@fb.com \
    --cc=darrick.wong@oracle.com \
    --cc=david@fromorbit.com \
    --cc=fdmanana@suse.com \
    --cc=jack@suse.cz \
    --cc=jaya@cs.utexas.edu \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=lwn@lwn.net \
    --cc=vijay@cs.utexas.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).