public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Amir Goldstein <amir73il@gmail.com>
Cc: linux-fsdevel <linux-fsdevel@vger.kernel.org>,
	lsf-pc <lsf-pc@lists.linux-foundation.org>,
	Jan Kara <jack@suse.cz>, Christian Brauner <brauner@kernel.org>,
	Josef Bacik <josef@toxicpanda.com>,
	Jeff Layton <jlayton@kernel.org>
Subject: Re: [LSF/MM/BPF TOPIC] vfs write barriers
Date: Tue, 28 Jan 2025 10:34:06 +1100	[thread overview]
Message-ID: <Z5gX7hTPhTtxam1g@dread.disaster.area> (raw)
In-Reply-To: <CAOQ4uxgYERCmPrTXjuM4Q3HdWK_HxuOkkpAEnesDHCAD=9fsOg@mail.gmail.com>

On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > This proposed write barrier does not seem capable of providing any
> > sort of physical data or metadata/data write ordering guarantees, so
> > I'm a bit lost in how it can be used to provide reliable "crash
> > consistent change tracking" when there is no relationship between
> > the data/metadata in memory and data/metadata on disk...
> 
> That's a good question. A bit hard to explain but I will try.
> 
> The short answer is that the vfs write barrier does *not* by itself
> provide the guarantee for "crash consistent change tracking".
> 
> In the prototype, the "crash consistent change tracking" guarantee
> is provided by the fact that the change records are recorded as
> as metadata in the same filesystem, prior to the modification and
> those metadata records are strictly ordered by the filesystem before
> the actual change.

Uh, ok.

I've read the docco and I think I understand what the prototype
you've pointed me at is doing.

It is using a separate chunk of the filesystem as a database to
persist change records for data in the filesystem. It is doing this
by creating an empty(?) file per change record in a per time
period (T) directory instance.

i.e.

write()
 -> pre-modify
  -> fanotify
   -> userspace HSM
    -> create file in dir T named "<filehandle-other-stuff>"

And then you're relying on the filesystem to make that directory
entry T/<filehandle-other-stuff> stable before the data the
pre-modify record was generated for ever gets written.

IOWs, you've specifically relying on *all unrelated metadata changes
in the filesystem* having strict global ordering *and* being
persisted before any data written after the metadata was created
is persisted.

Sure, this might work right now on XFS because the journalling
implementation -currently- provides global metadata ordering and
data/metadata ordering based on IO completion to submission
ordering.

However, we do not guarantee that XFS will -always- have this
behaviour. This is an *implementation detail*, not a guaranteed
behaviour we will preserve for all time. i.e. we reserve the right
to change how we do unrelated metadata and data/metadata ordering
internally.

This reminds of how applications observed that ext3 ordered mode
didn't require fsync to guarantee the data got written before the
metadata, so they elided the fsync() because it was really expensive
on ext3. i.e. they started relying on a specific filesystem
implementation detail for "correct crash consistency behaviour",
without understanding that it -only worked on ext3- and broken crash
consistency behaviour on all other filesystems. That was *bad*, and
it took a long time to get the message across that applications
*must* use fsync() for correct crash consistency behaviour...

What you are describing for your prototype HSM to provide crash
consistent change tracking really seems to me like it is reliant
on the side effects of specific filesystem implementation choices,
not a behaviour that all filesysetms guarantee.

i.e. not all filesystems provide strict global metadata ordering
semantics, and some fs maintainers are on record explicitly stating
that they will not provide or guarantee them. e.g. ext4, especially
with fast commits enabled, will not provide global strictly ordered
metadata semantics. btrfs also doesn't provide such a guarantee,
either.

> I would love to discuss the merits and pitfalls of this method, but the
> main thing I wanted to get feedback on is whether anyone finds the
> described vfs API useful for anything other that the change tracking
> system that I described.

If my understanding is correct, then this HSM prototype change
tracking mechanism seems like a fragile, unsupportable architecture.
I don't think we should be trying to add new VFS infrastructure to
make it work, because I think the underlying behaviours it requires
from filesystems are simply not guaranteed to exist.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

  parent reply	other threads:[~2025-01-27 23:34 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-17 18:01 [LSF/MM/BPF TOPIC] vfs write barriers Amir Goldstein
2025-01-19 21:15 ` Dave Chinner
2025-01-20 11:41   ` Amir Goldstein
2025-01-23  0:34     ` Dave Chinner
2025-01-23 14:01       ` Amir Goldstein
2025-01-23 18:14     ` Jeff Layton
2025-01-24 21:07       ` Amir Goldstein
2025-02-11 14:53       ` Jan Kara
2025-03-20 17:00         ` Amir Goldstein
2025-03-27 18:23           ` Amir Goldstein
2025-01-27 23:34     ` Dave Chinner [this message]
2025-01-29  1:39       ` Amir Goldstein
2025-02-11 21:12         ` Dave Chinner
2025-02-12  8:29           ` Amir Goldstein

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Z5gX7hTPhTtxam1g@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=amir73il@gmail.com \
    --cc=brauner@kernel.org \
    --cc=jack@suse.cz \
    --cc=jlayton@kernel.org \
    --cc=josef@toxicpanda.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox