[LSF/MM/BPF TOPIC] vfs write barriers

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [LSF/MM/BPF TOPIC] vfs write barriers
@ 2025-01-17 18:01 Amir Goldstein
  2025-01-19 21:15 ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2025-01-17 18:01 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton

Hi all,

I would like to present the idea of vfs write barriers that was proposed by Jan
and prototyped for the use of fanotify HSM change tracking events [1].

The historical records state that I had mentioned the idea briefly at the end of
my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
its wider implications at the time.

The vfs write barriers are implemented by taking a per-sb srcu read side
lock for the scope of {mnt,file}_{want,drop}_write().

This could be used by users - in the case of the prototype - an HSM service -
to wait for all in-flight write syscalls, without blocking new write syscalls
as the stricter fsfreeze() does.

This ability to wait for in-flight write syscalls is used by the prototype to
implement a crash consistent change tracking method [3] without the
need to use the heavy fsfreeze() hammer.

For the prototype, there is no user API to enable write barriers
or to wait for in-flight write syscalls, there is only an internal user
(fanotify), so the user API is only the fanotify API, but the
vfs infrastructure was written in a way that it could serve other
subsystems or be exposed to user applications via a vfs API.

I wanted to throw these questions to the crowd:
- Can you think of other internal use cases for SRCU scope for
  vfs write operations [*]? other vfs operations?
- Would it be useful to export this API to userspace so applications
  could make use of it?

[*] "vfs write operations" in this context refer to any operation
     that would block on a frozen fs.

I recall that Jeff mentioned there could be some use case related
to providing crash consistency to NFS change cookies, but I am
not sure if this is still relevant after the multigrain ctime work.

Thanks,
Amir.

[1] https://github.com/amir73il/linux/commits/sb_write_barrier
[2] https://lwn.net/Articles/932415/
[3] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#modified-files-query

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-17 18:01 [LSF/MM/BPF TOPIC] vfs write barriers Amir Goldstein
@ 2025-01-19 21:15 ` Dave Chinner
  2025-01-20 11:41   ` Amir Goldstein
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2025-01-19 21:15 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik,
	Jeff Layton

On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> Hi all,
> 
> I would like to present the idea of vfs write barriers that was proposed by Jan
> and prototyped for the use of fanotify HSM change tracking events [1].
> 
> The historical records state that I had mentioned the idea briefly at the end of
> my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> its wider implications at the time.
> 
> The vfs write barriers are implemented by taking a per-sb srcu read side
> lock for the scope of {mnt,file}_{want,drop}_write().
> 
> This could be used by users - in the case of the prototype - an HSM service -
> to wait for all in-flight write syscalls, without blocking new write syscalls
> as the stricter fsfreeze() does.
> 
> This ability to wait for in-flight write syscalls is used by the prototype to
> implement a crash consistent change tracking method [3] without the
> need to use the heavy fsfreeze() hammer.

How does this provide anything guarantee at all? It doesn't order or
wait for physical IOs in any way, so writeback can be active on a
file and writing data from both sides of a syscall write "barrier".
i.e. there is no coherency between what is on disk, the cmtime of
the inode and the write barrier itself.

Freeze is an actual physical write barrier. A very heavy handed
physical right barrier, yes, but it has very well defined and
bounded physical data persistence semantics.

This proposed write barrier does not seem capable of providing any
sort of physical data or metadata/data write ordering guarantees, so
I'm a bit lost in how it can be used to provide reliable "crash
consistent change tracking" when there is no relationship between
the data/metadata in memory and data/metadata on disk...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-19 21:15 ` Dave Chinner
@ 2025-01-20 11:41   ` Amir Goldstein
  2025-01-23  0:34     ` Dave Chinner
                       ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-01-20 11:41 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik,
	Jeff Layton

On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > Hi all,
> >
> > I would like to present the idea of vfs write barriers that was proposed by Jan
> > and prototyped for the use of fanotify HSM change tracking events [1].
> >
> > The historical records state that I had mentioned the idea briefly at the end of
> > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > its wider implications at the time.
> >
> > The vfs write barriers are implemented by taking a per-sb srcu read side
> > lock for the scope of {mnt,file}_{want,drop}_write().
> >
> > This could be used by users - in the case of the prototype - an HSM service -
> > to wait for all in-flight write syscalls, without blocking new write syscalls
> > as the stricter fsfreeze() does.
> >
> > This ability to wait for in-flight write syscalls is used by the prototype to
> > implement a crash consistent change tracking method [3] without the
> > need to use the heavy fsfreeze() hammer.
>
> How does this provide anything guarantee at all? It doesn't order or
> wait for physical IOs in any way, so writeback can be active on a
> file and writing data from both sides of a syscall write "barrier".
> i.e. there is no coherency between what is on disk, the cmtime of
> the inode and the write barrier itself.
>
> Freeze is an actual physical write barrier. A very heavy handed
> physical right barrier, yes, but it has very well defined and
> bounded physical data persistence semantics.

Yes. Freeze is a "write barrier to persistence storage".
This is not what "vfs write barrier" is about.
I will try to explain better.

Some syscalls modify the data/metadata of filesystem objects in memory
(a.k.a "in-core") and some syscalls query in-core data/metadata
of filesystem objects.

It is often the case that in-core data/metadata readers are not fully
synchronized with in-core data/metadata writers and it is often that
in-core data and metadata are not modified atomically w.r.t the
in-core data/metadata readers.
Even related metadata attributes are often not modified atomically
w.r.t to their readers (e.g. statx()).

When it comes to "observing changes" multigrain ctime/mtime has
improved things a lot for observing a change in ctime/mtime since
last sampled and for observing an order of ctime/mtime changes
on different inodes, but it hasn't changed the fact that ctime/mtime
changes can be observed *before* the respective metadata/data
changes can be observed.

An example problem is that a naive backup or indexing program can
read old data/metadata with new timestamp T and wrongly conclude
that it read all changes up to time T.

It is true that "real" backup programs know that applications and
filesystem needs to be quisences before backup, but actual
day to day cloud storage sync programs and indexers cannot
practically freeze the filesystem for their work.

For the HSM prototype, we track changes to a filesystem during
a given time period by handling pre-modify vfs events and recording
the file handles of changed objects.

sb_write_barrier(sb) provides an (internal so far) vfs API to wait
for in-flight syscalls that can be still modifying user visible in-core
data/metadata, without blocking new syscalls.

The method described in the HSM prototype [3] uses this API
to persist the state that all the changes until time T were "observed".

> This proposed write barrier does not seem capable of providing any
> sort of physical data or metadata/data write ordering guarantees, so
> I'm a bit lost in how it can be used to provide reliable "crash
> consistent change tracking" when there is no relationship between
> the data/metadata in memory and data/metadata on disk...

That's a good question. A bit hard to explain but I will try.

The short answer is that the vfs write barrier does *not* by itself
provide the guarantee for "crash consistent change tracking".

In the prototype, the "crash consistent change tracking" guarantee
is provided by the fact that the change records are recorded as
as metadata in the same filesystem, prior to the modification and
those metadata records are strictly ordered by the filesystem before
the actual change.

The vfs write barrier allows to partition the change tracking records
into overlapping time periods in a way that allows the *consumer* of
the changes to consume the changes in a "crash consistent manner",
because:

1. All the in-core changes recorded before the barrier are fully
    observable after the barrier
2. All the in-core changes that started after the barrier, will be recorded
    for the future change query

I would love to discuss the merits and pitfalls of this method, but the
main thing I wanted to get feedback on is whether anyone finds the
described vfs API useful for anything other that the change tracking
system that I described.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-20 11:41   ` Amir Goldstein
@ 2025-01-23  0:34     ` Dave Chinner
  2025-01-23 14:01       ` Amir Goldstein
  2025-01-23 18:14     ` Jeff Layton
  2025-01-27 23:34     ` Dave Chinner
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2025-01-23  0:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik,
	Jeff Layton

On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> For the HSM prototype, we track changes to a filesystem during
> a given time period by handling pre-modify vfs events and recording
> the file handles of changed objects.
> 
> sb_write_barrier(sb) provides an (internal so far) vfs API to wait
> for in-flight syscalls that can be still modifying user visible in-core
> data/metadata, without blocking new syscalls.

Yes, I get this part. What I don't understand is how it is in any
way useful....

> The method described in the HSM prototype [3] uses this API
> to persist the state that all the changes until time T were "observed".
> 
> > This proposed write barrier does not seem capable of providing any
> > sort of physical data or metadata/data write ordering guarantees, so
> > I'm a bit lost in how it can be used to provide reliable "crash
> > consistent change tracking" when there is no relationship between
> > the data/metadata in memory and data/metadata on disk...
> 
> That's a good question. A bit hard to explain but I will try.
> 
> The short answer is that the vfs write barrier does *not* by itself
> provide the guarantee for "crash consistent change tracking".
> 
> In the prototype, the "crash consistent change tracking" guarantee
> is provided by the fact that the change records are recorded as
> as metadata in the same filesystem, prior to the modification and
> those metadata records are strictly ordered by the filesystem before
> the actual change.

This doesn't make any sense to me - you seem to be making
assumptions that I know an awful lot about how your HSM prototype
works.

What's in a change record, when does it get written, what is it's
persistence semantics, what filesystem metadata is it being written
to? how does this relate to the actual dirty data that is
resident in the page cache that hasn't been written to stable
storage yet? Is there a another change record to say the data the
first change record tracks has been written to persistent storage?

> The vfs write barrier allows to partition the change tracking records
> into overlapping time periods in a way that allows the *consumer* of
> the changes to consume the changes in a "crash consistent manner",
> because:

> 1. All the in-core changes recorded before the barrier are fully
>     observable after the barrier
> 2. All the in-core changes that started after the barrier, will be recorded
>     for the future change query
> 
> I would love to discuss the merits and pitfalls of this method, but the
> main thing I wanted to get feedback on is whether anyone finds the
> described vfs API useful for anything other that the change tracking
> system that I described.

This seems like a very specialised niche use case right now, but I
still have no clear idea how the application using this proposed
write barrier actually works to acheive the stated functionality
this feature provides it with...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-23  0:34     ` Dave Chinner
@ 2025-01-23 14:01       ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-01-23 14:01 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik,
	Jeff Layton

On Thu, Jan 23, 2025 at 1:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> > For the HSM prototype, we track changes to a filesystem during
> > a given time period by handling pre-modify vfs events and recording
> > the file handles of changed objects.
> >
> > sb_write_barrier(sb) provides an (internal so far) vfs API to wait
> > for in-flight syscalls that can be still modifying user visible in-core
> > data/metadata, without blocking new syscalls.
>
> Yes, I get this part. What I don't understand is how it is in any
> way useful....
>
> > The method described in the HSM prototype [3] uses this API
> > to persist the state that all the changes until time T were "observed".
> >
> > > This proposed write barrier does not seem capable of providing any
> > > sort of physical data or metadata/data write ordering guarantees, so
> > > I'm a bit lost in how it can be used to provide reliable "crash
> > > consistent change tracking" when there is no relationship between
> > > the data/metadata in memory and data/metadata on disk...
> >
> > That's a good question. A bit hard to explain but I will try.
> >
> > The short answer is that the vfs write barrier does *not* by itself
> > provide the guarantee for "crash consistent change tracking".
> >
> > In the prototype, the "crash consistent change tracking" guarantee
> > is provided by the fact that the change records are recorded as
> > as metadata in the same filesystem, prior to the modification and
> > those metadata records are strictly ordered by the filesystem before
> > the actual change.
>
> This doesn't make any sense to me - you seem to be making
> assumptions that I know an awful lot about how your HSM prototype
> works.
>
> What's in a change record

The prototype creates a directory entry of this name:

changed_dirs/$T/<directory file handle hex>

which gets created if it does not exist before a change in a directory
or before a change to a file's data/metadata [*].

[*] For non-dir, the change record is for ANY parent of the file
if the file is unlinked, we have no need to track changes
if the file is disconnected it's up to the HSM to decide if to block the change
or not record it

> when does it get written,

from handling of fanotify pre-modify events (not upstream yet)
*before* the change to in-core data/metadata
which are hooked inside {file,mnt}_want_write() wrappers
*before* {file,sb}_start_write()

> what is it's persistence semantics

The consumer (HSM service) is responsible for persisting
change records (e.g. by fsync of changed_dirs/$T/)

The only guarantee is expects from the filesystem is the
the change records (directory entries) are strictly ordered
to storage before data/metadata changes that are executed
after writing the change record.

> what filesystem metadata is it being written to?

For the prototype it is a directory index,
but that is an implementation detail of this prototype.

> how does this relate to the actual dirty data that is
> resident in the page cache that hasn't been written to stable
> storage yet?

The relation is as follows:
- HSM starts recording change records under both
  changed_dirs/$T/ and changed_dirs/$((T+1))/
- HSM calls sb_write_barrier() and syncfs()
- Then HSM stops recording changes in changed_dirs/$T/

So by the time changed_dirs/$T/ is "sealed", all the dirty data
will be either persistent in stable storage
OR also recorded in changed_dirs/$((T+1))/

> Is there a another change record to say the data the
> first change record tracks has been written to persistent storage?
>

Yes, I use a symlink to denote the "current" live change tracking session,
something like:

$ ln -sf $((T)) changed_dirs/current
...
$ ln -sf $((T+1)) changed_dirs/next
... (write barrier etc)
$ sync -f changed_dirs # seal current
$ mv changed_dirs/next changed_dirs/current

As you can see, I was trying to avoid tying the persistence semantics
to the kernel implementation of HSM.

As far as I can tell, the only thing I am missing from the kernel is
the vfs write barrier in order to take care of the rest in userspace.

Yes, there is this baby elephant in the room that "strictly ordered metadata"
is not in any contract, but I am willing to live with that for now, for the
benefits of filesystem agnostic HSM implementation.

> > The vfs write barrier allows to partition the change tracking records
> > into overlapping time periods in a way that allows the *consumer* of
> > the changes to consume the changes in a "crash consistent manner",
> > because:
>
> > 1. All the in-core changes recorded before the barrier are fully
> >     observable after the barrier
> > 2. All the in-core changes that started after the barrier, will be recorded
> >     for the future change query
> >
> > I would love to discuss the merits and pitfalls of this method, but the
> > main thing I wanted to get feedback on is whether anyone finds the
> > described vfs API useful for anything other that the change tracking
> > system that I described.
>
> This seems like a very specialised niche use case right now, but I
> still have no clear idea how the application using this proposed
> write barrier actually works to acheive the stated functionality
> this feature provides it with...
>

The problem that vfs write barrier is trying to solve is the problem
of order between changing and observing in-core data/metadata.
It seems like a problem that is more generic than my specialized
niche, but maybe it isn't.

The consumer of change tracking will start observing (reading)
the data/metadata only after sealing the period $T records,
so it avoids the risk of observing old data/metadata in a directory
recorded in period $T, without having another record in period $T+1.

The point in all this story is that the vfs write barrier is needed even if
there is no syncfs() at all and if the application does not care about
persistence at all.

For example, for an application that syncs files to a replica storage,
without the write barrier, the change query T can result in reading non-update
data/metadata and reach the incorrect conclusion that *everything is in sync*.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-20 11:41   ` Amir Goldstein
  2025-01-23  0:34     ` Dave Chinner
@ 2025-01-23 18:14     ` Jeff Layton
  2025-01-24 21:07       ` Amir Goldstein
  2025-02-11 14:53       ` Jan Kara
  2025-01-27 23:34     ` Dave Chinner
  2 siblings, 2 replies; 14+ messages in thread
From: Jeff Layton @ 2025-01-23 18:14 UTC (permalink / raw)
  To: Amir Goldstein, Dave Chinner
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik

On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote:
> On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > 
> > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > > Hi all,
> > > 
> > > I would like to present the idea of vfs write barriers that was proposed by Jan
> > > and prototyped for the use of fanotify HSM change tracking events [1].
> > > 
> > > The historical records state that I had mentioned the idea briefly at the end of
> > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > > its wider implications at the time.
> > > 
> > > The vfs write barriers are implemented by taking a per-sb srcu read side
> > > lock for the scope of {mnt,file}_{want,drop}_write().
> > > 
> > > This could be used by users - in the case of the prototype - an HSM service -
> > > to wait for all in-flight write syscalls, without blocking new write syscalls
> > > as the stricter fsfreeze() does.
> > > 
> > > This ability to wait for in-flight write syscalls is used by the prototype to
> > > implement a crash consistent change tracking method [3] without the
> > > need to use the heavy fsfreeze() hammer.
> > 
> > How does this provide anything guarantee at all? It doesn't order or
> > wait for physical IOs in any way, so writeback can be active on a
> > file and writing data from both sides of a syscall write "barrier".
> > i.e. there is no coherency between what is on disk, the cmtime of
> > the inode and the write barrier itself.
> > 
> > Freeze is an actual physical write barrier. A very heavy handed
> > physical right barrier, yes, but it has very well defined and
> > bounded physical data persistence semantics.
> 
> Yes. Freeze is a "write barrier to persistence storage".
> This is not what "vfs write barrier" is about.
> I will try to explain better.
> 
> Some syscalls modify the data/metadata of filesystem objects in memory
> (a.k.a "in-core") and some syscalls query in-core data/metadata
> of filesystem objects.
> 
> It is often the case that in-core data/metadata readers are not fully
> synchronized with in-core data/metadata writers and it is often that
> in-core data and metadata are not modified atomically w.r.t the
> in-core data/metadata readers.
> Even related metadata attributes are often not modified atomically
> w.r.t to their readers (e.g. statx()).
> 
> When it comes to "observing changes" multigrain ctime/mtime has
> improved things a lot for observing a change in ctime/mtime since
> last sampled and for observing an order of ctime/mtime changes
> on different inodes, but it hasn't changed the fact that ctime/mtime
> changes can be observed *before* the respective metadata/data
> changes can be observed.
> 
> An example problem is that a naive backup or indexing program can
> read old data/metadata with new timestamp T and wrongly conclude
> that it read all changes up to time T.
> 
> It is true that "real" backup programs know that applications and
> filesystem needs to be quisences before backup, but actual
> day to day cloud storage sync programs and indexers cannot
> practically freeze the filesystem for their work.
> 

Right. That is still a known problem. For directory operations, the
i_rwsem keeps things consistent, but for regular files, it's possible
to see new timestamps alongside with old file contents. That's a
problem since caching algorithms that watch for timestamp changes can
end up not seeing the new contents until the _next_ change occurs,
which might not ever happen.

It would be better to change the file write code to update the
timestamps after copying data to the pagecache. It would still be
possible in that case to see old attributes + new contents, but that's
preferable to the reverse for callers that are watching for changes to
attributes.

Would fixing that help your use-case at all?

> For the HSM prototype, we track changes to a filesystem during
> a given time period by handling pre-modify vfs events and recording
> the file handles of changed objects.
> 
> sb_write_barrier(sb) provides an (internal so far) vfs API to wait
> for in-flight syscalls that can be still modifying user visible in-core
> data/metadata, without blocking new syscalls.
> 
> The method described in the HSM prototype [3] uses this API
> to persist the state that all the changes until time T were "observed".
> 
> > This proposed write barrier does not seem capable of providing any
> > sort of physical data or metadata/data write ordering guarantees, so
> > I'm a bit lost in how it can be used to provide reliable "crash
> > consistent change tracking" when there is no relationship between
> > the data/metadata in memory and data/metadata on disk...
> 
> That's a good question. A bit hard to explain but I will try.
> 
> The short answer is that the vfs write barrier does *not* by itself
> provide the guarantee for "crash consistent change tracking".
> 
> In the prototype, the "crash consistent change tracking" guarantee
> is provided by the fact that the change records are recorded as
> as metadata in the same filesystem, prior to the modification and
> those metadata records are strictly ordered by the filesystem before
> the actual change.
> 
> The vfs write barrier allows to partition the change tracking records
> into overlapping time periods in a way that allows the *consumer* of
> the changes to consume the changes in a "crash consistent manner",
> because:
> 
> 1. All the in-core changes recorded before the barrier are fully
>     observable after the barrier
> 2. All the in-core changes that started after the barrier, will be recorded
>     for the future change query
> 
> I would love to discuss the merits and pitfalls of this method, but the
> main thing I wanted to get feedback on is whether anyone finds the
> described vfs API useful for anything other that the change tracking
> system that I described.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-23 18:14     ` Jeff Layton
@ 2025-01-24 21:07       ` Amir Goldstein
  2025-02-11 14:53       ` Jan Kara
  1 sibling, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-01-24 21:07 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Dave Chinner, linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner,
	Josef Bacik

On Thu, Jan 23, 2025 at 7:14 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote:
> > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > > > Hi all,
> > > >
> > > > I would like to present the idea of vfs write barriers that was proposed by Jan
> > > > and prototyped for the use of fanotify HSM change tracking events [1].
> > > >
> > > > The historical records state that I had mentioned the idea briefly at the end of
> > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > > > its wider implications at the time.
> > > >
> > > > The vfs write barriers are implemented by taking a per-sb srcu read side
> > > > lock for the scope of {mnt,file}_{want,drop}_write().
> > > >
> > > > This could be used by users - in the case of the prototype - an HSM service -
> > > > to wait for all in-flight write syscalls, without blocking new write syscalls
> > > > as the stricter fsfreeze() does.
> > > >
> > > > This ability to wait for in-flight write syscalls is used by the prototype to
> > > > implement a crash consistent change tracking method [3] without the
> > > > need to use the heavy fsfreeze() hammer.
> > >
> > > How does this provide anything guarantee at all? It doesn't order or
> > > wait for physical IOs in any way, so writeback can be active on a
> > > file and writing data from both sides of a syscall write "barrier".
> > > i.e. there is no coherency between what is on disk, the cmtime of
> > > the inode and the write barrier itself.
> > >
> > > Freeze is an actual physical write barrier. A very heavy handed
> > > physical right barrier, yes, but it has very well defined and
> > > bounded physical data persistence semantics.
> >
> > Yes. Freeze is a "write barrier to persistence storage".
> > This is not what "vfs write barrier" is about.
> > I will try to explain better.
> >
> > Some syscalls modify the data/metadata of filesystem objects in memory
> > (a.k.a "in-core") and some syscalls query in-core data/metadata
> > of filesystem objects.
> >
> > It is often the case that in-core data/metadata readers are not fully
> > synchronized with in-core data/metadata writers and it is often that
> > in-core data and metadata are not modified atomically w.r.t the
> > in-core data/metadata readers.
> > Even related metadata attributes are often not modified atomically
> > w.r.t to their readers (e.g. statx()).
> >
> > When it comes to "observing changes" multigrain ctime/mtime has
> > improved things a lot for observing a change in ctime/mtime since
> > last sampled and for observing an order of ctime/mtime changes
> > on different inodes, but it hasn't changed the fact that ctime/mtime
> > changes can be observed *before* the respective metadata/data
> > changes can be observed.
> >
> > An example problem is that a naive backup or indexing program can
> > read old data/metadata with new timestamp T and wrongly conclude
> > that it read all changes up to time T.
> >
> > It is true that "real" backup programs know that applications and
> > filesystem needs to be quisences before backup, but actual
> > day to day cloud storage sync programs and indexers cannot
> > practically freeze the filesystem for their work.
> >
>
> Right. That is still a known problem. For directory operations, the
> i_rwsem keeps things consistent, but for regular files, it's possible
> to see new timestamps alongside with old file contents. That's a
> problem since caching algorithms that watch for timestamp changes can
> end up not seeing the new contents until the _next_ change occurs,
> which might not ever happen.
>
> It would be better to change the file write code to update the
> timestamps after copying data to the pagecache. It would still be
> possible in that case to see old attributes + new contents, but that's
> preferable to the reverse for callers that are watching for changes to
> attributes.
>

Yes, I remember this was discussed.
I think it may make sense to update before and after copying data to page cache?

> Would fixing that help your use-case at all?
>

I don't think it would, because my use case is not about querying
the change status of a single inode. It post change timestamp update
helps I don't see how.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-20 11:41   ` Amir Goldstein
  2025-01-23  0:34     ` Dave Chinner
  2025-01-23 18:14     ` Jeff Layton
@ 2025-01-27 23:34     ` Dave Chinner
  2025-01-29  1:39       ` Amir Goldstein
  2 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2025-01-27 23:34 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik,
	Jeff Layton

On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > This proposed write barrier does not seem capable of providing any
> > sort of physical data or metadata/data write ordering guarantees, so
> > I'm a bit lost in how it can be used to provide reliable "crash
> > consistent change tracking" when there is no relationship between
> > the data/metadata in memory and data/metadata on disk...
> 
> That's a good question. A bit hard to explain but I will try.
> 
> The short answer is that the vfs write barrier does *not* by itself
> provide the guarantee for "crash consistent change tracking".
> 
> In the prototype, the "crash consistent change tracking" guarantee
> is provided by the fact that the change records are recorded as
> as metadata in the same filesystem, prior to the modification and
> those metadata records are strictly ordered by the filesystem before
> the actual change.

Uh, ok.

I've read the docco and I think I understand what the prototype
you've pointed me at is doing.

It is using a separate chunk of the filesystem as a database to
persist change records for data in the filesystem. It is doing this
by creating an empty(?) file per change record in a per time
period (T) directory instance.

i.e.

write()
 -> pre-modify
  -> fanotify
   -> userspace HSM
    -> create file in dir T named "<filehandle-other-stuff>"

And then you're relying on the filesystem to make that directory
entry T/<filehandle-other-stuff> stable before the data the
pre-modify record was generated for ever gets written.

IOWs, you've specifically relying on *all unrelated metadata changes
in the filesystem* having strict global ordering *and* being
persisted before any data written after the metadata was created
is persisted.

Sure, this might work right now on XFS because the journalling
implementation -currently- provides global metadata ordering and
data/metadata ordering based on IO completion to submission
ordering.

However, we do not guarantee that XFS will -always- have this
behaviour. This is an *implementation detail*, not a guaranteed
behaviour we will preserve for all time. i.e. we reserve the right
to change how we do unrelated metadata and data/metadata ordering
internally.

This reminds of how applications observed that ext3 ordered mode
didn't require fsync to guarantee the data got written before the
metadata, so they elided the fsync() because it was really expensive
on ext3. i.e. they started relying on a specific filesystem
implementation detail for "correct crash consistency behaviour",
without understanding that it -only worked on ext3- and broken crash
consistency behaviour on all other filesystems. That was *bad*, and
it took a long time to get the message across that applications
*must* use fsync() for correct crash consistency behaviour...

What you are describing for your prototype HSM to provide crash
consistent change tracking really seems to me like it is reliant
on the side effects of specific filesystem implementation choices,
not a behaviour that all filesysetms guarantee.

i.e. not all filesystems provide strict global metadata ordering
semantics, and some fs maintainers are on record explicitly stating
that they will not provide or guarantee them. e.g. ext4, especially
with fast commits enabled, will not provide global strictly ordered
metadata semantics. btrfs also doesn't provide such a guarantee,
either.

> I would love to discuss the merits and pitfalls of this method, but the
> main thing I wanted to get feedback on is whether anyone finds the
> described vfs API useful for anything other that the change tracking
> system that I described.

If my understanding is correct, then this HSM prototype change
tracking mechanism seems like a fragile, unsupportable architecture.
I don't think we should be trying to add new VFS infrastructure to
make it work, because I think the underlying behaviours it requires
from filesystems are simply not guaranteed to exist.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-27 23:34     ` Dave Chinner
@ 2025-01-29  1:39       ` Amir Goldstein
  2025-02-11 21:12         ` Dave Chinner
  0 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2025-01-29  1:39 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik,
	Jeff Layton

On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@fromorbit.com> wrote:
>
> On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > > This proposed write barrier does not seem capable of providing any
> > > sort of physical data or metadata/data write ordering guarantees, so
> > > I'm a bit lost in how it can be used to provide reliable "crash
> > > consistent change tracking" when there is no relationship between
> > > the data/metadata in memory and data/metadata on disk...
> >
> > That's a good question. A bit hard to explain but I will try.
> >
> > The short answer is that the vfs write barrier does *not* by itself
> > provide the guarantee for "crash consistent change tracking".
> >
> > In the prototype, the "crash consistent change tracking" guarantee
> > is provided by the fact that the change records are recorded as
> > as metadata in the same filesystem, prior to the modification and
> > those metadata records are strictly ordered by the filesystem before
> > the actual change.
>
> Uh, ok.
>
> I've read the docco and I think I understand what the prototype
> you've pointed me at is doing.
>
> It is using a separate chunk of the filesystem as a database to
> persist change records for data in the filesystem. It is doing this
> by creating an empty(?) file per change record in a per time
> period (T) directory instance.
>
> i.e.
>
> write()
>  -> pre-modify
>   -> fanotify
>    -> userspace HSM
>     -> create file in dir T named "<filehandle-other-stuff>"
>
> And then you're relying on the filesystem to make that directory
> entry T/<filehandle-other-stuff> stable before the data the
> pre-modify record was generated for ever gets written.
>

Yes.

> IOWs, you've specifically relying on *all unrelated metadata changes
> in the filesystem* having strict global ordering *and* being
> persisted before any data written after the metadata was created
> is persisted.
>
> Sure, this might work right now on XFS because the journalling
> implementation -currently- provides global metadata ordering and
> data/metadata ordering based on IO completion to submission
> ordering.
>

Yes.

> However, we do not guarantee that XFS will -always- have this
> behaviour. This is an *implementation detail*, not a guaranteed
> behaviour we will preserve for all time. i.e. we reserve the right
> to change how we do unrelated metadata and data/metadata ordering
> internally.
>

Yes, that's why its a prototype, but its a userspace prototype.
The requirements from the kernel API won't change if the userspace
server would have used an independent nvram to store the change record.

> This reminds of how applications observed that ext3 ordered mode
> didn't require fsync to guarantee the data got written before the
> metadata, so they elided the fsync() because it was really expensive
> on ext3. i.e. they started relying on a specific filesystem
> implementation detail for "correct crash consistency behaviour",
> without understanding that it -only worked on ext3- and broken crash
> consistency behaviour on all other filesystems. That was *bad*, and
> it took a long time to get the message across that applications
> *must* use fsync() for correct crash consistency behaviour...

I am familiar with that episode.

>
> What you are describing for your prototype HSM to provide crash
> consistent change tracking really seems to me like it is reliant
> on the side effects of specific filesystem implementation choices,
> not a behaviour that all filesysetms guarantee.
>
> i.e. not all filesystems provide strict global metadata ordering
> semantics, and some fs maintainers are on record explicitly stating
> that they will not provide or guarantee them. e.g. ext4, especially
> with fast commits enabled, will not provide global strictly ordered
> metadata semantics. btrfs also doesn't provide such a guarantee,
> either.
>

Right. We did once a proposal to formalize this contract [1],
but its a bit off topic.

> > I would love to discuss the merits and pitfalls of this method, but the
> > main thing I wanted to get feedback on is whether anyone finds the
> > described vfs API useful for anything other that the change tracking
> > system that I described.
>
> If my understanding is correct, then this HSM prototype change
> tracking mechanism seems like a fragile, unsupportable architecture.
> I don't think we should be trying to add new VFS infrastructure to
> make it work, because I think the underlying behaviours it requires
> from filesystems are simply not guaranteed to exist.
>

That's a valid opinion.

Do you have an idea for a better design for fs agnostic change tracking?

I mean, sure, we can re-implement DMAPI in specific fs, but I don't think
anyone would like that.

IMO The metadata ordering contract is a technical matter that could be fixed.

I still hold the opinion that the in-core changes order w.r.t readers
is a problem
regardless of persistence to disk, but I may need to come up with more
compelling
use cases to demonstrate this problem.

Thanks,
Amir.

[1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-23 18:14     ` Jeff Layton
  2025-01-24 21:07       ` Amir Goldstein
@ 2025-02-11 14:53       ` Jan Kara
  2025-03-20 17:00         ` Amir Goldstein
  1 sibling, 1 reply; 14+ messages in thread
From: Jan Kara @ 2025-02-11 14:53 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Amir Goldstein, Dave Chinner, linux-fsdevel, lsf-pc, Jan Kara,
	Christian Brauner, Josef Bacik

On Thu 23-01-25 13:14:11, Jeff Layton wrote:
> On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote:
> > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > > 
> > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > > > Hi all,
> > > > 
> > > > I would like to present the idea of vfs write barriers that was proposed by Jan
> > > > and prototyped for the use of fanotify HSM change tracking events [1].
> > > > 
> > > > The historical records state that I had mentioned the idea briefly at the end of
> > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > > > its wider implications at the time.
> > > > 
> > > > The vfs write barriers are implemented by taking a per-sb srcu read side
> > > > lock for the scope of {mnt,file}_{want,drop}_write().
> > > > 
> > > > This could be used by users - in the case of the prototype - an HSM service -
> > > > to wait for all in-flight write syscalls, without blocking new write syscalls
> > > > as the stricter fsfreeze() does.
> > > > 
> > > > This ability to wait for in-flight write syscalls is used by the prototype to
> > > > implement a crash consistent change tracking method [3] without the
> > > > need to use the heavy fsfreeze() hammer.
> > > 
> > > How does this provide anything guarantee at all? It doesn't order or
> > > wait for physical IOs in any way, so writeback can be active on a
> > > file and writing data from both sides of a syscall write "barrier".
> > > i.e. there is no coherency between what is on disk, the cmtime of
> > > the inode and the write barrier itself.
> > > 
> > > Freeze is an actual physical write barrier. A very heavy handed
> > > physical right barrier, yes, but it has very well defined and
> > > bounded physical data persistence semantics.
> > 
> > Yes. Freeze is a "write barrier to persistence storage".
> > This is not what "vfs write barrier" is about.
> > I will try to explain better.
> > 
> > Some syscalls modify the data/metadata of filesystem objects in memory
> > (a.k.a "in-core") and some syscalls query in-core data/metadata
> > of filesystem objects.
> > 
> > It is often the case that in-core data/metadata readers are not fully
> > synchronized with in-core data/metadata writers and it is often that
> > in-core data and metadata are not modified atomically w.r.t the
> > in-core data/metadata readers.
> > Even related metadata attributes are often not modified atomically
> > w.r.t to their readers (e.g. statx()).
> > 
> > When it comes to "observing changes" multigrain ctime/mtime has
> > improved things a lot for observing a change in ctime/mtime since
> > last sampled and for observing an order of ctime/mtime changes
> > on different inodes, but it hasn't changed the fact that ctime/mtime
> > changes can be observed *before* the respective metadata/data
> > changes can be observed.
> > 
> > An example problem is that a naive backup or indexing program can
> > read old data/metadata with new timestamp T and wrongly conclude
> > that it read all changes up to time T.
> > 
> > It is true that "real" backup programs know that applications and
> > filesystem needs to be quisences before backup, but actual
> > day to day cloud storage sync programs and indexers cannot
> > practically freeze the filesystem for their work.
> > 
> 
> Right. That is still a known problem. For directory operations, the
> i_rwsem keeps things consistent, but for regular files, it's possible
> to see new timestamps alongside with old file contents. That's a
> problem since caching algorithms that watch for timestamp changes can
> end up not seeing the new contents until the _next_ change occurs,
> which might not ever happen.
> 
> It would be better to change the file write code to update the
> timestamps after copying data to the pagecache. It would still be
> possible in that case to see old attributes + new contents, but that's
> preferable to the reverse for callers that are watching for changes to
> attributes.
> 
> Would fixing that help your use-case at all?

I think Amir wanted to make here a point in the other direction: I.e., if
the application did:
 * sample inode timestamp
 * vfs_write_barrier()
 * read file data

then it is *guaranteed* it will never see old data & new timestamp and hence
the caching problem is solved. No need to update timestamp after the write.

Now I agree updating timestamps after write is much nicer from usability
POV (given how common pattern above it) but this is just a simple example
demonstrating possible uses for vfs_write_barrier().

								Honza

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-01-29  1:39       ` Amir Goldstein
@ 2025-02-11 21:12         ` Dave Chinner
  2025-02-12  8:29           ` Amir Goldstein
  0 siblings, 1 reply; 14+ messages in thread
From: Dave Chinner @ 2025-02-11 21:12 UTC (permalink / raw)
  To: Amir Goldstein
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik,
	Jeff Layton

On Wed, Jan 29, 2025 at 02:39:56AM +0100, Amir Goldstein wrote:
> On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@fromorbit.com> wrote:
> >
> > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > This proposed write barrier does not seem capable of providing any
> > > > sort of physical data or metadata/data write ordering guarantees, so
> > > > I'm a bit lost in how it can be used to provide reliable "crash
> > > > consistent change tracking" when there is no relationship between
> > > > the data/metadata in memory and data/metadata on disk...
> > >
> > > That's a good question. A bit hard to explain but I will try.
> > >
> > > The short answer is that the vfs write barrier does *not* by itself
> > > provide the guarantee for "crash consistent change tracking".
> > >
> > > In the prototype, the "crash consistent change tracking" guarantee
> > > is provided by the fact that the change records are recorded as
> > > as metadata in the same filesystem, prior to the modification and
> > > those metadata records are strictly ordered by the filesystem before
> > > the actual change.
> >
> > Uh, ok.
> >
> > I've read the docco and I think I understand what the prototype
> > you've pointed me at is doing.
> >
> > It is using a separate chunk of the filesystem as a database to
> > persist change records for data in the filesystem. It is doing this
> > by creating an empty(?) file per change record in a per time
> > period (T) directory instance.
> >
> > i.e.
> >
> > write()
> >  -> pre-modify
> >   -> fanotify
> >    -> userspace HSM
> >     -> create file in dir T named "<filehandle-other-stuff>"
> >
> > And then you're relying on the filesystem to make that directory
> > entry T/<filehandle-other-stuff> stable before the data the
> > pre-modify record was generated for ever gets written.
> >
> 
> Yes.
> 
> > IOWs, you've specifically relying on *all unrelated metadata changes
> > in the filesystem* having strict global ordering *and* being
> > persisted before any data written after the metadata was created
> > is persisted.
> >
> > Sure, this might work right now on XFS because the journalling
> > implementation -currently- provides global metadata ordering and
> > data/metadata ordering based on IO completion to submission
> > ordering.
> >
> 
> Yes.

[....]

> > > I would love to discuss the merits and pitfalls of this method, but the
> > > main thing I wanted to get feedback on is whether anyone finds the
> > > described vfs API useful for anything other that the change tracking
> > > system that I described.
> >
> > If my understanding is correct, then this HSM prototype change
> > tracking mechanism seems like a fragile, unsupportable architecture.
> > I don't think we should be trying to add new VFS infrastructure to
> > make it work, because I think the underlying behaviours it requires
> > from filesystems are simply not guaranteed to exist.
> >
> 
> That's a valid opinion.
> 
> Do you have an idea for a better design for fs agnostic change tracking?

Store your HSM metadata in a database on a different storage device
and only signal the pre-modification notification as complete once
the database has completed it's update transaction.

> I mean, sure, we can re-implement DMAPI in specific fs, but I don't think
> anyone would like that.

DMAPI pre-modification notifications didn't rely on side effects of
filesystem behaviour for correctness. The HSM had to guarantee that
it's recording of events were stable before it allowed the
modification to be done. Lots of dmapi modification notifications
used pre- and post- event notifications so the HSM could keep track
of modifications that were in flight at any given point in time.

That way the HSM recovery process knew after a crash which files it
needed to go look at to determine if the operation in progress had
completed or not once the system came back up....

> IMO The metadata ordering contract is a technical matter that could be fixed.
> 
> I still hold the opinion that the in-core changes order w.r.t readers
> is a problem
> regardless of persistence to disk, but I may need to come up with more
> compelling
> use cases to demonstrate this problem.

IIRC, the XFS DMAPI implementation solved that problem by blocking
read notifications whilst there was a pending modification
notification outstanding. The problem with the Linux DMAPI
implementation of this (one of the show stoppers that prevented
merge) was that it held a rwsem across syscall contexts to provide
this functionality.....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-02-11 21:12         ` Dave Chinner
@ 2025-02-12  8:29           ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-02-12  8:29 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik,
	Jeff Layton

On Tue, Feb 11, 2025 at 10:12 PM Dave Chinner <david@fromorbit.com> wrote:
>
> On Wed, Jan 29, 2025 at 02:39:56AM +0100, Amir Goldstein wrote:
> > On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@fromorbit.com> wrote:
> > >
> > > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote:
> > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > > This proposed write barrier does not seem capable of providing any
> > > > > sort of physical data or metadata/data write ordering guarantees, so
> > > > > I'm a bit lost in how it can be used to provide reliable "crash
> > > > > consistent change tracking" when there is no relationship between
> > > > > the data/metadata in memory and data/metadata on disk...
> > > >
> > > > That's a good question. A bit hard to explain but I will try.
> > > >
> > > > The short answer is that the vfs write barrier does *not* by itself
> > > > provide the guarantee for "crash consistent change tracking".
> > > >
> > > > In the prototype, the "crash consistent change tracking" guarantee
> > > > is provided by the fact that the change records are recorded as
> > > > as metadata in the same filesystem, prior to the modification and
> > > > those metadata records are strictly ordered by the filesystem before
> > > > the actual change.
> > >
> > > Uh, ok.
> > >
> > > I've read the docco and I think I understand what the prototype
> > > you've pointed me at is doing.
> > >
> > > It is using a separate chunk of the filesystem as a database to
> > > persist change records for data in the filesystem. It is doing this
> > > by creating an empty(?) file per change record in a per time
> > > period (T) directory instance.
> > >
> > > i.e.
> > >
> > > write()
> > >  -> pre-modify
> > >   -> fanotify
> > >    -> userspace HSM
> > >     -> create file in dir T named "<filehandle-other-stuff>"
> > >
> > > And then you're relying on the filesystem to make that directory
> > > entry T/<filehandle-other-stuff> stable before the data the
> > > pre-modify record was generated for ever gets written.
> > >
> >
> > Yes.
> >
> > > IOWs, you've specifically relying on *all unrelated metadata changes
> > > in the filesystem* having strict global ordering *and* being
> > > persisted before any data written after the metadata was created
> > > is persisted.
> > >
> > > Sure, this might work right now on XFS because the journalling
> > > implementation -currently- provides global metadata ordering and
> > > data/metadata ordering based on IO completion to submission
> > > ordering.
> > >
> >
> > Yes.
>
> [....]
>
> > > > I would love to discuss the merits and pitfalls of this method, but the
> > > > main thing I wanted to get feedback on is whether anyone finds the
> > > > described vfs API useful for anything other that the change tracking
> > > > system that I described.
> > >
> > > If my understanding is correct, then this HSM prototype change
> > > tracking mechanism seems like a fragile, unsupportable architecture.
> > > I don't think we should be trying to add new VFS infrastructure to
> > > make it work, because I think the underlying behaviours it requires
> > > from filesystems are simply not guaranteed to exist.
> > >
> >
> > That's a valid opinion.
> >
> > Do you have an idea for a better design for fs agnostic change tracking?
>
> Store your HSM metadata in a database on a different storage device
> and only signal the pre-modification notification as complete once
> the database has completed it's update transaction.
>

Yes, naturally.
This was exactly my point in saying that on-disk persistence
is completely orthogonal to the purpose for which sb_write_barrier()
API is being proposed.

> > I mean, sure, we can re-implement DMAPI in specific fs, but I don't think
> > anyone would like that.
>
> DMAPI pre-modification notifications didn't rely on side effects of
> filesystem behaviour for correctness.

Neither does fanotify.
My HSM prototype is relying on some XFS side effects.
A production HSM using the same fanotify API could store
changes in a db on another fs or on persistent memory.

> The HSM had to guarantee that
> it's recording of events were stable before it allowed the
> modification to be done.

No change in methodology here.

> Lots of dmapi modification notifications
> used pre- and post- event notifications so the HSM could keep track
> of modifications that were in flight at any given point in time.
>

OK, now we are talking about the relevant point.
Persistent "recording" an intent to change on pre- is fine.
"Notifying" the application that change has been done in pre- is racy,
because the application may wrongly believe that it has already
consumed the notified/recorded change.

Complementing every single pre- event with a matching post-
event is one possible solution and I think Jan and I discussed it as well.
sb_write_barrier() is a much easier API for HSM, because HSM
rarely needs to consume a single change, it is much more likely
to consume a large batch of changes, so the sb_write_barrier() API
is a much more efficient way of getting the same guarantee that
"All the changes recorded with pre- events are observable".

> That way the HSM recovery process knew after a crash which files it
> needed to go look at to determine if the operation in progress had
> completed or not once the system came back up....
>

Yes, exactly what we need and what sb_write_barrier() helps to achieve.

> > IMO The metadata ordering contract is a technical matter that could be fixed.
> >
> > I still hold the opinion that the in-core changes order w.r.t readers
> > is a problem
> > regardless of persistence to disk, but I may need to come up with more
> > compelling
> > use cases to demonstrate this problem.
>
> IIRC, the XFS DMAPI implementation solved that problem by blocking
> read notifications whilst there was a pending modification
> notification outstanding. The problem with the Linux DMAPI
> implementation of this (one of the show stoppers that prevented
> merge) was that it held a rwsem across syscall contexts to provide
> this functionality.....
>

sb_write_barrier() allows HSM to archive the same end result without
holding rwsem across syscalls context.
It's literally SRCU instead of the DMAPI rwsem. Not more, not less:

sb_start_write_srcu() --> notify change intent --> HSM record to changes db
               <-- ack change intent recorded <--
...
make in-core changes
...
               <-- wait for changes in-flight <-- sb_write_barrier()
sb_end_write_srcu() --> ack changes in-flight -->
                 <-- persist recorded changes <-- syncfs()
persist in-core changes
                      --> ack persist changes --> HSM notify change consumers

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-02-11 14:53       ` Jan Kara
@ 2025-03-20 17:00         ` Amir Goldstein
  2025-03-27 18:23           ` Amir Goldstein
  0 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2025-03-20 17:00 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Dave Chinner, linux-fsdevel, lsf-pc,
	Christian Brauner, Josef Bacik

On Tue, Feb 11, 2025 at 5:22 PM Jan Kara <jack@suse.cz> wrote:
>
> On Thu 23-01-25 13:14:11, Jeff Layton wrote:
> > On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote:
> > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > > >
> > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > > > > Hi all,
> > > > >
> > > > > I would like to present the idea of vfs write barriers that was proposed by Jan
> > > > > and prototyped for the use of fanotify HSM change tracking events [1].
> > > > >
> > > > > The historical records state that I had mentioned the idea briefly at the end of
> > > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > > > > its wider implications at the time.
> > > > >
> > > > > The vfs write barriers are implemented by taking a per-sb srcu read side
> > > > > lock for the scope of {mnt,file}_{want,drop}_write().
> > > > >
> > > > > This could be used by users - in the case of the prototype - an HSM service -
> > > > > to wait for all in-flight write syscalls, without blocking new write syscalls
> > > > > as the stricter fsfreeze() does.
> > > > >
> > > > > This ability to wait for in-flight write syscalls is used by the prototype to
> > > > > implement a crash consistent change tracking method [3] without the
> > > > > need to use the heavy fsfreeze() hammer.
> > > >
> > > > How does this provide anything guarantee at all? It doesn't order or
> > > > wait for physical IOs in any way, so writeback can be active on a
> > > > file and writing data from both sides of a syscall write "barrier".
> > > > i.e. there is no coherency between what is on disk, the cmtime of
> > > > the inode and the write barrier itself.
> > > >
> > > > Freeze is an actual physical write barrier. A very heavy handed
> > > > physical right barrier, yes, but it has very well defined and
> > > > bounded physical data persistence semantics.
> > >
> > > Yes. Freeze is a "write barrier to persistence storage".
> > > This is not what "vfs write barrier" is about.
> > > I will try to explain better.
> > >
> > > Some syscalls modify the data/metadata of filesystem objects in memory
> > > (a.k.a "in-core") and some syscalls query in-core data/metadata
> > > of filesystem objects.
> > >
> > > It is often the case that in-core data/metadata readers are not fully
> > > synchronized with in-core data/metadata writers and it is often that
> > > in-core data and metadata are not modified atomically w.r.t the
> > > in-core data/metadata readers.
> > > Even related metadata attributes are often not modified atomically
> > > w.r.t to their readers (e.g. statx()).
> > >
> > > When it comes to "observing changes" multigrain ctime/mtime has
> > > improved things a lot for observing a change in ctime/mtime since
> > > last sampled and for observing an order of ctime/mtime changes
> > > on different inodes, but it hasn't changed the fact that ctime/mtime
> > > changes can be observed *before* the respective metadata/data
> > > changes can be observed.
> > >
> > > An example problem is that a naive backup or indexing program can
> > > read old data/metadata with new timestamp T and wrongly conclude
> > > that it read all changes up to time T.
> > >
> > > It is true that "real" backup programs know that applications and
> > > filesystem needs to be quisences before backup, but actual
> > > day to day cloud storage sync programs and indexers cannot
> > > practically freeze the filesystem for their work.
> > >
> >
> > Right. That is still a known problem. For directory operations, the
> > i_rwsem keeps things consistent, but for regular files, it's possible
> > to see new timestamps alongside with old file contents. That's a
> > problem since caching algorithms that watch for timestamp changes can
> > end up not seeing the new contents until the _next_ change occurs,
> > which might not ever happen.
> >
> > It would be better to change the file write code to update the
> > timestamps after copying data to the pagecache. It would still be
> > possible in that case to see old attributes + new contents, but that's
> > preferable to the reverse for callers that are watching for changes to
> > attributes.
> >
> > Would fixing that help your use-case at all?
>
> I think Amir wanted to make here a point in the other direction: I.e., if
> the application did:
>  * sample inode timestamp
>  * vfs_write_barrier()
>  * read file data
>
> then it is *guaranteed* it will never see old data & new timestamp and hence
> the caching problem is solved. No need to update timestamp after the write.
>
> Now I agree updating timestamps after write is much nicer from usability
> POV (given how common pattern above it) but this is just a simple example
> demonstrating possible uses for vfs_write_barrier().
>

I was trying to figure out if updating timestamp after write would be enough
to deal with file writes and I think that it is not enough when adding
signalling
(events) into the picture.
In this case, the consumer is expected to act on changes (e.g. index/backup)
soon after they happen.
I think this case is different from NFS cache which only cares about cache
invalidation on file access(?).

In any case, we need a FAN_PRE_MODIFY blocking event to store a
persistent change intent record before the write - that is needed to find
changes after a crash.

Now unless we want to start polling ctime (and we do not want that),
we need a signal to wake the consumer after the write to page cache

One way is to rely on the FAN_MODIFY async event post write.
But there is ambiguity in the existing FAN_MODIFY events:

    Thread A starts write on file F (no listener for FAN_PRE_MODIFY)
Event consumer starts
        Thread B starts write on file F
        FAN_PRE_MODIFY(F) reported from thread B
    Thread A completes write on file F
    FAN_MODIFY(F) reported from thread A (or from aio completion thread)
Event consumer believes it got the last event and can read the final
version of F

So if we use this method we will need a unique cookie to
associate the POST_MODIFY with the PRE_MODIFY event.

Something like this:

writer                                [fsnotifyd]
-------                                -------------
file_start_write_usn() => FAN_PRE_MODIFY[ fsid, usn, fhandle ]
{                                 <= Record change intent before response
…do some in-core changes
   (e.g. data + mode + ctime)...
} file_end_write_usn() => FAN_POST_MODIFY[ fsid, usn, fhandle ]
                                         Consume changes after FAN_POST_MODIFY

While this is a viable option, it adds yet more hooks and more
events and it does not provide an easy way for consumers to
wait for the completion of a batch of modifications.

The vfs_write_barrier method provides a better way to wait for completion:

writer                                [fsnotifyd]
-------                                -------------
file_start_write_srcu() => FAN_PRE_MODIFY[ fsid, usn, fhandle ]
{                                  <= Record change intent before response
…do some in-core changes under srcu read lock
   (e.g. data + mode + ctime)...
} file_end_write_srcu()
     synchronize_srcu()   <= vfs_write_barrier();
                    Consume a batch of recorded changes after write barrier
                    act on the changes and clear the change intent records

I am hoping to be able to argue for the case of vfs_write_barrier()
in LSFMM, but if this will not be acceptable, I can work with the
post modify events solution.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [LSF/MM/BPF TOPIC] vfs write barriers
  2025-03-20 17:00         ` Amir Goldstein
@ 2025-03-27 18:23           ` Amir Goldstein
  0 siblings, 0 replies; 14+ messages in thread
From: Amir Goldstein @ 2025-03-27 18:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Jeff Layton, Dave Chinner, linux-fsdevel, lsf-pc,
	Christian Brauner, Josef Bacik

On Thu, Mar 20, 2025 at 6:00 PM Amir Goldstein <amir73il@gmail.com> wrote:
>
> On Tue, Feb 11, 2025 at 5:22 PM Jan Kara <jack@suse.cz> wrote:
> >
> > On Thu 23-01-25 13:14:11, Jeff Layton wrote:
> > > On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote:
> > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote:
> > > > >
> > > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote:
> > > > > > Hi all,
> > > > > >
> > > > > > I would like to present the idea of vfs write barriers that was proposed by Jan
> > > > > > and prototyped for the use of fanotify HSM change tracking events [1].
> > > > > >
> > > > > > The historical records state that I had mentioned the idea briefly at the end of
> > > > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
> > > > > > its wider implications at the time.
> > > > > >
> > > > > > The vfs write barriers are implemented by taking a per-sb srcu read side
> > > > > > lock for the scope of {mnt,file}_{want,drop}_write().
> > > > > >
> > > > > > This could be used by users - in the case of the prototype - an HSM service -
> > > > > > to wait for all in-flight write syscalls, without blocking new write syscalls
> > > > > > as the stricter fsfreeze() does.
> > > > > >
> > > > > > This ability to wait for in-flight write syscalls is used by the prototype to
> > > > > > implement a crash consistent change tracking method [3] without the
> > > > > > need to use the heavy fsfreeze() hammer.
> > > > >
> > > > > How does this provide anything guarantee at all? It doesn't order or
> > > > > wait for physical IOs in any way, so writeback can be active on a
> > > > > file and writing data from both sides of a syscall write "barrier".
> > > > > i.e. there is no coherency between what is on disk, the cmtime of
> > > > > the inode and the write barrier itself.
> > > > >
> > > > > Freeze is an actual physical write barrier. A very heavy handed
> > > > > physical right barrier, yes, but it has very well defined and
> > > > > bounded physical data persistence semantics.
> > > >
> > > > Yes. Freeze is a "write barrier to persistence storage".
> > > > This is not what "vfs write barrier" is about.
> > > > I will try to explain better.
> > > >
> > > > Some syscalls modify the data/metadata of filesystem objects in memory
> > > > (a.k.a "in-core") and some syscalls query in-core data/metadata
> > > > of filesystem objects.
> > > >
> > > > It is often the case that in-core data/metadata readers are not fully
> > > > synchronized with in-core data/metadata writers and it is often that
> > > > in-core data and metadata are not modified atomically w.r.t the
> > > > in-core data/metadata readers.
> > > > Even related metadata attributes are often not modified atomically
> > > > w.r.t to their readers (e.g. statx()).
> > > >
> > > > When it comes to "observing changes" multigrain ctime/mtime has
> > > > improved things a lot for observing a change in ctime/mtime since
> > > > last sampled and for observing an order of ctime/mtime changes
> > > > on different inodes, but it hasn't changed the fact that ctime/mtime
> > > > changes can be observed *before* the respective metadata/data
> > > > changes can be observed.
> > > >
> > > > An example problem is that a naive backup or indexing program can
> > > > read old data/metadata with new timestamp T and wrongly conclude
> > > > that it read all changes up to time T.
> > > >
> > > > It is true that "real" backup programs know that applications and
> > > > filesystem needs to be quisences before backup, but actual
> > > > day to day cloud storage sync programs and indexers cannot
> > > > practically freeze the filesystem for their work.
> > > >
> > >
> > > Right. That is still a known problem. For directory operations, the
> > > i_rwsem keeps things consistent, but for regular files, it's possible
> > > to see new timestamps alongside with old file contents. That's a
> > > problem since caching algorithms that watch for timestamp changes can
> > > end up not seeing the new contents until the _next_ change occurs,
> > > which might not ever happen.
> > >
> > > It would be better to change the file write code to update the
> > > timestamps after copying data to the pagecache. It would still be
> > > possible in that case to see old attributes + new contents, but that's
> > > preferable to the reverse for callers that are watching for changes to
> > > attributes.
> > >
> > > Would fixing that help your use-case at all?
> >
> > I think Amir wanted to make here a point in the other direction: I.e., if
> > the application did:
> >  * sample inode timestamp
> >  * vfs_write_barrier()
> >  * read file data
> >
> > then it is *guaranteed* it will never see old data & new timestamp and hence
> > the caching problem is solved. No need to update timestamp after the write.
> >
> > Now I agree updating timestamps after write is much nicer from usability
> > POV (given how common pattern above it) but this is just a simple example
> > demonstrating possible uses for vfs_write_barrier().
> >
>
> I was trying to figure out if updating timestamp after write would be enough
> to deal with file writes and I think that it is not enough when adding
> signalling
> (events) into the picture.
> In this case, the consumer is expected to act on changes (e.g. index/backup)
> soon after they happen.
> I think this case is different from NFS cache which only cares about cache
> invalidation on file access(?).
>
> In any case, we need a FAN_PRE_MODIFY blocking event to store a
> persistent change intent record before the write - that is needed to find
> changes after a crash.
>
> Now unless we want to start polling ctime (and we do not want that),
> we need a signal to wake the consumer after the write to page cache
>
> One way is to rely on the FAN_MODIFY async event post write.
> But there is ambiguity in the existing FAN_MODIFY events:
>
>     Thread A starts write on file F (no listener for FAN_PRE_MODIFY)
> Event consumer starts
>         Thread B starts write on file F
>         FAN_PRE_MODIFY(F) reported from thread B
>     Thread A completes write on file F
>     FAN_MODIFY(F) reported from thread A (or from aio completion thread)
> Event consumer believes it got the last event and can read the final
> version of F
>
> So if we use this method we will need a unique cookie to
> associate the POST_MODIFY with the PRE_MODIFY event.
>
> Something like this:
>
> writer                                [fsnotifyd]
> -------                                -------------
> file_start_write_usn() => FAN_PRE_MODIFY[ fsid, usn, fhandle ]
> {                                 <= Record change intent before response
> …do some in-core changes
>    (e.g. data + mode + ctime)...
> } file_end_write_usn() => FAN_POST_MODIFY[ fsid, usn, fhandle ]
>                                          Consume changes after FAN_POST_MODIFY
>
> While this is a viable option, it adds yet more hooks and more
> events and it does not provide an easy way for consumers to
> wait for the completion of a batch of modifications.
>
> The vfs_write_barrier method provides a better way to wait for completion:
>
> writer                                [fsnotifyd]
> -------                                -------------
> file_start_write_srcu() => FAN_PRE_MODIFY[ fsid, usn, fhandle ]
> {                                  <= Record change intent before response
> …do some in-core changes under srcu read lock
>    (e.g. data + mode + ctime)...
> } file_end_write_srcu()
>      synchronize_srcu()   <= vfs_write_barrier();
>                     Consume a batch of recorded changes after write barrier
>                     act on the changes and clear the change intent records
>
> I am hoping to be able to argue for the case of vfs_write_barrier()
> in LSFMM, but if this will not be acceptable, I can work with the
> post modify events solution.
>

FYI, I had discussed it with some folks at LSFMM after my talk
and what was apparent to me from this chat and also from the questions
during my presentation, is that I did not succeed in explaining the problem.

I believe that the path forward for me, which is something that Jan
has told me from the beginning, is to implement a reference design
of persistent change journal, because this is too complex of an API
to discuss without the user code that uses it.

I am still on the fence about whether I want to do a userspace fsnotifyfd
or a kernel persistent change journal library/subsystem as a reference
design. I do already have a kernel subsystem (ovl watch) so I may end
up cleaning that one up to use a proper fanotify API and maybe that would
be the way to do it.

One more thing that I realised during LSFMM, is that some filesystems
(e.g. NTFS, Lustre) already have an internal persistent change journal.
If I implement a kernel persistent change journal subsystem, then
we could use the same fanotify API to read events from fs that implements
its own persistent change journal and from a fs that allows to use the
fs agnostic persistent change journal.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-03-27 18:23 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-17 18:01 [LSF/MM/BPF TOPIC] vfs write barriers Amir Goldstein
2025-01-19 21:15 ` Dave Chinner
2025-01-20 11:41   ` Amir Goldstein
2025-01-23  0:34     ` Dave Chinner
2025-01-23 14:01       ` Amir Goldstein
2025-01-23 18:14     ` Jeff Layton
2025-01-24 21:07       ` Amir Goldstein
2025-02-11 14:53       ` Jan Kara
2025-03-20 17:00         ` Amir Goldstein
2025-03-27 18:23           ` Amir Goldstein
2025-01-27 23:34     ` Dave Chinner
2025-01-29  1:39       ` Amir Goldstein
2025-02-11 21:12         ` Dave Chinner
2025-02-12  8:29           ` Amir Goldstein

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox