* [LSF/MM/BPF TOPIC] vfs write barriers
@ 2025-01-17 18:01 Amir Goldstein
2025-01-19 21:15 ` Dave Chinner
0 siblings, 1 reply; 14+ messages in thread
From: Amir Goldstein @ 2025-01-17 18:01 UTC (permalink / raw)
To: linux-fsdevel
Cc: lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton
Hi all,
I would like to present the idea of vfs write barriers that was proposed by Jan
and prototyped for the use of fanotify HSM change tracking events [1].
The historical records state that I had mentioned the idea briefly at the end of
my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss
its wider implications at the time.
The vfs write barriers are implemented by taking a per-sb srcu read side
lock for the scope of {mnt,file}_{want,drop}_write().
This could be used by users - in the case of the prototype - an HSM service -
to wait for all in-flight write syscalls, without blocking new write syscalls
as the stricter fsfreeze() does.
This ability to wait for in-flight write syscalls is used by the prototype to
implement a crash consistent change tracking method [3] without the
need to use the heavy fsfreeze() hammer.
For the prototype, there is no user API to enable write barriers
or to wait for in-flight write syscalls, there is only an internal user
(fanotify), so the user API is only the fanotify API, but the
vfs infrastructure was written in a way that it could serve other
subsystems or be exposed to user applications via a vfs API.
I wanted to throw these questions to the crowd:
- Can you think of other internal use cases for SRCU scope for
vfs write operations [*]? other vfs operations?
- Would it be useful to export this API to userspace so applications
could make use of it?
[*] "vfs write operations" in this context refer to any operation
that would block on a frozen fs.
I recall that Jeff mentioned there could be some use case related
to providing crash consistency to NFS change cookies, but I am
not sure if this is still relevant after the multigrain ctime work.
Thanks,
Amir.
[1] https://github.com/amir73il/linux/commits/sb_write_barrier
[2] https://lwn.net/Articles/932415/
[3] https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#modified-files-query
^ permalink raw reply [flat|nested] 14+ messages in thread* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-17 18:01 [LSF/MM/BPF TOPIC] vfs write barriers Amir Goldstein @ 2025-01-19 21:15 ` Dave Chinner 2025-01-20 11:41 ` Amir Goldstein 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2025-01-19 21:15 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote: > Hi all, > > I would like to present the idea of vfs write barriers that was proposed by Jan > and prototyped for the use of fanotify HSM change tracking events [1]. > > The historical records state that I had mentioned the idea briefly at the end of > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss > its wider implications at the time. > > The vfs write barriers are implemented by taking a per-sb srcu read side > lock for the scope of {mnt,file}_{want,drop}_write(). > > This could be used by users - in the case of the prototype - an HSM service - > to wait for all in-flight write syscalls, without blocking new write syscalls > as the stricter fsfreeze() does. > > This ability to wait for in-flight write syscalls is used by the prototype to > implement a crash consistent change tracking method [3] without the > need to use the heavy fsfreeze() hammer. How does this provide anything guarantee at all? It doesn't order or wait for physical IOs in any way, so writeback can be active on a file and writing data from both sides of a syscall write "barrier". i.e. there is no coherency between what is on disk, the cmtime of the inode and the write barrier itself. Freeze is an actual physical write barrier. A very heavy handed physical right barrier, yes, but it has very well defined and bounded physical data persistence semantics. This proposed write barrier does not seem capable of providing any sort of physical data or metadata/data write ordering guarantees, so I'm a bit lost in how it can be used to provide reliable "crash consistent change tracking" when there is no relationship between the data/metadata in memory and data/metadata on disk... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-19 21:15 ` Dave Chinner @ 2025-01-20 11:41 ` Amir Goldstein 2025-01-23 0:34 ` Dave Chinner ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Amir Goldstein @ 2025-01-20 11:41 UTC (permalink / raw) To: Dave Chinner Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote: > > Hi all, > > > > I would like to present the idea of vfs write barriers that was proposed by Jan > > and prototyped for the use of fanotify HSM change tracking events [1]. > > > > The historical records state that I had mentioned the idea briefly at the end of > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss > > its wider implications at the time. > > > > The vfs write barriers are implemented by taking a per-sb srcu read side > > lock for the scope of {mnt,file}_{want,drop}_write(). > > > > This could be used by users - in the case of the prototype - an HSM service - > > to wait for all in-flight write syscalls, without blocking new write syscalls > > as the stricter fsfreeze() does. > > > > This ability to wait for in-flight write syscalls is used by the prototype to > > implement a crash consistent change tracking method [3] without the > > need to use the heavy fsfreeze() hammer. > > How does this provide anything guarantee at all? It doesn't order or > wait for physical IOs in any way, so writeback can be active on a > file and writing data from both sides of a syscall write "barrier". > i.e. there is no coherency between what is on disk, the cmtime of > the inode and the write barrier itself. > > Freeze is an actual physical write barrier. A very heavy handed > physical right barrier, yes, but it has very well defined and > bounded physical data persistence semantics. Yes. Freeze is a "write barrier to persistence storage". This is not what "vfs write barrier" is about. I will try to explain better. Some syscalls modify the data/metadata of filesystem objects in memory (a.k.a "in-core") and some syscalls query in-core data/metadata of filesystem objects. It is often the case that in-core data/metadata readers are not fully synchronized with in-core data/metadata writers and it is often that in-core data and metadata are not modified atomically w.r.t the in-core data/metadata readers. Even related metadata attributes are often not modified atomically w.r.t to their readers (e.g. statx()). When it comes to "observing changes" multigrain ctime/mtime has improved things a lot for observing a change in ctime/mtime since last sampled and for observing an order of ctime/mtime changes on different inodes, but it hasn't changed the fact that ctime/mtime changes can be observed *before* the respective metadata/data changes can be observed. An example problem is that a naive backup or indexing program can read old data/metadata with new timestamp T and wrongly conclude that it read all changes up to time T. It is true that "real" backup programs know that applications and filesystem needs to be quisences before backup, but actual day to day cloud storage sync programs and indexers cannot practically freeze the filesystem for their work. For the HSM prototype, we track changes to a filesystem during a given time period by handling pre-modify vfs events and recording the file handles of changed objects. sb_write_barrier(sb) provides an (internal so far) vfs API to wait for in-flight syscalls that can be still modifying user visible in-core data/metadata, without blocking new syscalls. The method described in the HSM prototype [3] uses this API to persist the state that all the changes until time T were "observed". > This proposed write barrier does not seem capable of providing any > sort of physical data or metadata/data write ordering guarantees, so > I'm a bit lost in how it can be used to provide reliable "crash > consistent change tracking" when there is no relationship between > the data/metadata in memory and data/metadata on disk... That's a good question. A bit hard to explain but I will try. The short answer is that the vfs write barrier does *not* by itself provide the guarantee for "crash consistent change tracking". In the prototype, the "crash consistent change tracking" guarantee is provided by the fact that the change records are recorded as as metadata in the same filesystem, prior to the modification and those metadata records are strictly ordered by the filesystem before the actual change. The vfs write barrier allows to partition the change tracking records into overlapping time periods in a way that allows the *consumer* of the changes to consume the changes in a "crash consistent manner", because: 1. All the in-core changes recorded before the barrier are fully observable after the barrier 2. All the in-core changes that started after the barrier, will be recorded for the future change query I would love to discuss the merits and pitfalls of this method, but the main thing I wanted to get feedback on is whether anyone finds the described vfs API useful for anything other that the change tracking system that I described. Thanks, Amir. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-20 11:41 ` Amir Goldstein @ 2025-01-23 0:34 ` Dave Chinner 2025-01-23 14:01 ` Amir Goldstein 2025-01-23 18:14 ` Jeff Layton 2025-01-27 23:34 ` Dave Chinner 2 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2025-01-23 0:34 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > For the HSM prototype, we track changes to a filesystem during > a given time period by handling pre-modify vfs events and recording > the file handles of changed objects. > > sb_write_barrier(sb) provides an (internal so far) vfs API to wait > for in-flight syscalls that can be still modifying user visible in-core > data/metadata, without blocking new syscalls. Yes, I get this part. What I don't understand is how it is in any way useful.... > The method described in the HSM prototype [3] uses this API > to persist the state that all the changes until time T were "observed". > > > This proposed write barrier does not seem capable of providing any > > sort of physical data or metadata/data write ordering guarantees, so > > I'm a bit lost in how it can be used to provide reliable "crash > > consistent change tracking" when there is no relationship between > > the data/metadata in memory and data/metadata on disk... > > That's a good question. A bit hard to explain but I will try. > > The short answer is that the vfs write barrier does *not* by itself > provide the guarantee for "crash consistent change tracking". > > In the prototype, the "crash consistent change tracking" guarantee > is provided by the fact that the change records are recorded as > as metadata in the same filesystem, prior to the modification and > those metadata records are strictly ordered by the filesystem before > the actual change. This doesn't make any sense to me - you seem to be making assumptions that I know an awful lot about how your HSM prototype works. What's in a change record, when does it get written, what is it's persistence semantics, what filesystem metadata is it being written to? how does this relate to the actual dirty data that is resident in the page cache that hasn't been written to stable storage yet? Is there a another change record to say the data the first change record tracks has been written to persistent storage? > The vfs write barrier allows to partition the change tracking records > into overlapping time periods in a way that allows the *consumer* of > the changes to consume the changes in a "crash consistent manner", > because: > 1. All the in-core changes recorded before the barrier are fully > observable after the barrier > 2. All the in-core changes that started after the barrier, will be recorded > for the future change query > > I would love to discuss the merits and pitfalls of this method, but the > main thing I wanted to get feedback on is whether anyone finds the > described vfs API useful for anything other that the change tracking > system that I described. This seems like a very specialised niche use case right now, but I still have no clear idea how the application using this proposed write barrier actually works to acheive the stated functionality this feature provides it with... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-23 0:34 ` Dave Chinner @ 2025-01-23 14:01 ` Amir Goldstein 0 siblings, 0 replies; 14+ messages in thread From: Amir Goldstein @ 2025-01-23 14:01 UTC (permalink / raw) To: Dave Chinner Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton On Thu, Jan 23, 2025 at 1:34 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > > For the HSM prototype, we track changes to a filesystem during > > a given time period by handling pre-modify vfs events and recording > > the file handles of changed objects. > > > > sb_write_barrier(sb) provides an (internal so far) vfs API to wait > > for in-flight syscalls that can be still modifying user visible in-core > > data/metadata, without blocking new syscalls. > > Yes, I get this part. What I don't understand is how it is in any > way useful.... > > > The method described in the HSM prototype [3] uses this API > > to persist the state that all the changes until time T were "observed". > > > > > This proposed write barrier does not seem capable of providing any > > > sort of physical data or metadata/data write ordering guarantees, so > > > I'm a bit lost in how it can be used to provide reliable "crash > > > consistent change tracking" when there is no relationship between > > > the data/metadata in memory and data/metadata on disk... > > > > That's a good question. A bit hard to explain but I will try. > > > > The short answer is that the vfs write barrier does *not* by itself > > provide the guarantee for "crash consistent change tracking". > > > > In the prototype, the "crash consistent change tracking" guarantee > > is provided by the fact that the change records are recorded as > > as metadata in the same filesystem, prior to the modification and > > those metadata records are strictly ordered by the filesystem before > > the actual change. > > This doesn't make any sense to me - you seem to be making > assumptions that I know an awful lot about how your HSM prototype > works. > > What's in a change record The prototype creates a directory entry of this name: changed_dirs/$T/<directory file handle hex> which gets created if it does not exist before a change in a directory or before a change to a file's data/metadata [*]. [*] For non-dir, the change record is for ANY parent of the file if the file is unlinked, we have no need to track changes if the file is disconnected it's up to the HSM to decide if to block the change or not record it > when does it get written, from handling of fanotify pre-modify events (not upstream yet) *before* the change to in-core data/metadata which are hooked inside {file,mnt}_want_write() wrappers *before* {file,sb}_start_write() > what is it's persistence semantics The consumer (HSM service) is responsible for persisting change records (e.g. by fsync of changed_dirs/$T/) The only guarantee is expects from the filesystem is the the change records (directory entries) are strictly ordered to storage before data/metadata changes that are executed after writing the change record. > what filesystem metadata is it being written to? For the prototype it is a directory index, but that is an implementation detail of this prototype. > how does this relate to the actual dirty data that is > resident in the page cache that hasn't been written to stable > storage yet? The relation is as follows: - HSM starts recording change records under both changed_dirs/$T/ and changed_dirs/$((T+1))/ - HSM calls sb_write_barrier() and syncfs() - Then HSM stops recording changes in changed_dirs/$T/ So by the time changed_dirs/$T/ is "sealed", all the dirty data will be either persistent in stable storage OR also recorded in changed_dirs/$((T+1))/ > Is there a another change record to say the data the > first change record tracks has been written to persistent storage? > Yes, I use a symlink to denote the "current" live change tracking session, something like: $ ln -sf $((T)) changed_dirs/current ... $ ln -sf $((T+1)) changed_dirs/next ... (write barrier etc) $ sync -f changed_dirs # seal current $ mv changed_dirs/next changed_dirs/current As you can see, I was trying to avoid tying the persistence semantics to the kernel implementation of HSM. As far as I can tell, the only thing I am missing from the kernel is the vfs write barrier in order to take care of the rest in userspace. Yes, there is this baby elephant in the room that "strictly ordered metadata" is not in any contract, but I am willing to live with that for now, for the benefits of filesystem agnostic HSM implementation. > > The vfs write barrier allows to partition the change tracking records > > into overlapping time periods in a way that allows the *consumer* of > > the changes to consume the changes in a "crash consistent manner", > > because: > > > 1. All the in-core changes recorded before the barrier are fully > > observable after the barrier > > 2. All the in-core changes that started after the barrier, will be recorded > > for the future change query > > > > I would love to discuss the merits and pitfalls of this method, but the > > main thing I wanted to get feedback on is whether anyone finds the > > described vfs API useful for anything other that the change tracking > > system that I described. > > This seems like a very specialised niche use case right now, but I > still have no clear idea how the application using this proposed > write barrier actually works to acheive the stated functionality > this feature provides it with... > The problem that vfs write barrier is trying to solve is the problem of order between changing and observing in-core data/metadata. It seems like a problem that is more generic than my specialized niche, but maybe it isn't. The consumer of change tracking will start observing (reading) the data/metadata only after sealing the period $T records, so it avoids the risk of observing old data/metadata in a directory recorded in period $T, without having another record in period $T+1. The point in all this story is that the vfs write barrier is needed even if there is no syncfs() at all and if the application does not care about persistence at all. For example, for an application that syncs files to a replica storage, without the write barrier, the change query T can result in reading non-update data/metadata and reach the incorrect conclusion that *everything is in sync*. Thanks, Amir. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-20 11:41 ` Amir Goldstein 2025-01-23 0:34 ` Dave Chinner @ 2025-01-23 18:14 ` Jeff Layton 2025-01-24 21:07 ` Amir Goldstein 2025-02-11 14:53 ` Jan Kara 2025-01-27 23:34 ` Dave Chinner 2 siblings, 2 replies; 14+ messages in thread From: Jeff Layton @ 2025-01-23 18:14 UTC (permalink / raw) To: Amir Goldstein, Dave Chinner Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote: > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote: > > > Hi all, > > > > > > I would like to present the idea of vfs write barriers that was proposed by Jan > > > and prototyped for the use of fanotify HSM change tracking events [1]. > > > > > > The historical records state that I had mentioned the idea briefly at the end of > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss > > > its wider implications at the time. > > > > > > The vfs write barriers are implemented by taking a per-sb srcu read side > > > lock for the scope of {mnt,file}_{want,drop}_write(). > > > > > > This could be used by users - in the case of the prototype - an HSM service - > > > to wait for all in-flight write syscalls, without blocking new write syscalls > > > as the stricter fsfreeze() does. > > > > > > This ability to wait for in-flight write syscalls is used by the prototype to > > > implement a crash consistent change tracking method [3] without the > > > need to use the heavy fsfreeze() hammer. > > > > How does this provide anything guarantee at all? It doesn't order or > > wait for physical IOs in any way, so writeback can be active on a > > file and writing data from both sides of a syscall write "barrier". > > i.e. there is no coherency between what is on disk, the cmtime of > > the inode and the write barrier itself. > > > > Freeze is an actual physical write barrier. A very heavy handed > > physical right barrier, yes, but it has very well defined and > > bounded physical data persistence semantics. > > Yes. Freeze is a "write barrier to persistence storage". > This is not what "vfs write barrier" is about. > I will try to explain better. > > Some syscalls modify the data/metadata of filesystem objects in memory > (a.k.a "in-core") and some syscalls query in-core data/metadata > of filesystem objects. > > It is often the case that in-core data/metadata readers are not fully > synchronized with in-core data/metadata writers and it is often that > in-core data and metadata are not modified atomically w.r.t the > in-core data/metadata readers. > Even related metadata attributes are often not modified atomically > w.r.t to their readers (e.g. statx()). > > When it comes to "observing changes" multigrain ctime/mtime has > improved things a lot for observing a change in ctime/mtime since > last sampled and for observing an order of ctime/mtime changes > on different inodes, but it hasn't changed the fact that ctime/mtime > changes can be observed *before* the respective metadata/data > changes can be observed. > > An example problem is that a naive backup or indexing program can > read old data/metadata with new timestamp T and wrongly conclude > that it read all changes up to time T. > > It is true that "real" backup programs know that applications and > filesystem needs to be quisences before backup, but actual > day to day cloud storage sync programs and indexers cannot > practically freeze the filesystem for their work. > Right. That is still a known problem. For directory operations, the i_rwsem keeps things consistent, but for regular files, it's possible to see new timestamps alongside with old file contents. That's a problem since caching algorithms that watch for timestamp changes can end up not seeing the new contents until the _next_ change occurs, which might not ever happen. It would be better to change the file write code to update the timestamps after copying data to the pagecache. It would still be possible in that case to see old attributes + new contents, but that's preferable to the reverse for callers that are watching for changes to attributes. Would fixing that help your use-case at all? > For the HSM prototype, we track changes to a filesystem during > a given time period by handling pre-modify vfs events and recording > the file handles of changed objects. > > sb_write_barrier(sb) provides an (internal so far) vfs API to wait > for in-flight syscalls that can be still modifying user visible in-core > data/metadata, without blocking new syscalls. > > The method described in the HSM prototype [3] uses this API > to persist the state that all the changes until time T were "observed". > > > This proposed write barrier does not seem capable of providing any > > sort of physical data or metadata/data write ordering guarantees, so > > I'm a bit lost in how it can be used to provide reliable "crash > > consistent change tracking" when there is no relationship between > > the data/metadata in memory and data/metadata on disk... > > That's a good question. A bit hard to explain but I will try. > > The short answer is that the vfs write barrier does *not* by itself > provide the guarantee for "crash consistent change tracking". > > In the prototype, the "crash consistent change tracking" guarantee > is provided by the fact that the change records are recorded as > as metadata in the same filesystem, prior to the modification and > those metadata records are strictly ordered by the filesystem before > the actual change. > > The vfs write barrier allows to partition the change tracking records > into overlapping time periods in a way that allows the *consumer* of > the changes to consume the changes in a "crash consistent manner", > because: > > 1. All the in-core changes recorded before the barrier are fully > observable after the barrier > 2. All the in-core changes that started after the barrier, will be recorded > for the future change query > > I would love to discuss the merits and pitfalls of this method, but the > main thing I wanted to get feedback on is whether anyone finds the > described vfs API useful for anything other that the change tracking > system that I described. -- Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-23 18:14 ` Jeff Layton @ 2025-01-24 21:07 ` Amir Goldstein 2025-02-11 14:53 ` Jan Kara 1 sibling, 0 replies; 14+ messages in thread From: Amir Goldstein @ 2025-01-24 21:07 UTC (permalink / raw) To: Jeff Layton Cc: Dave Chinner, linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik On Thu, Jan 23, 2025 at 7:14 PM Jeff Layton <jlayton@kernel.org> wrote: > > On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote: > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote: > > > > Hi all, > > > > > > > > I would like to present the idea of vfs write barriers that was proposed by Jan > > > > and prototyped for the use of fanotify HSM change tracking events [1]. > > > > > > > > The historical records state that I had mentioned the idea briefly at the end of > > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss > > > > its wider implications at the time. > > > > > > > > The vfs write barriers are implemented by taking a per-sb srcu read side > > > > lock for the scope of {mnt,file}_{want,drop}_write(). > > > > > > > > This could be used by users - in the case of the prototype - an HSM service - > > > > to wait for all in-flight write syscalls, without blocking new write syscalls > > > > as the stricter fsfreeze() does. > > > > > > > > This ability to wait for in-flight write syscalls is used by the prototype to > > > > implement a crash consistent change tracking method [3] without the > > > > need to use the heavy fsfreeze() hammer. > > > > > > How does this provide anything guarantee at all? It doesn't order or > > > wait for physical IOs in any way, so writeback can be active on a > > > file and writing data from both sides of a syscall write "barrier". > > > i.e. there is no coherency between what is on disk, the cmtime of > > > the inode and the write barrier itself. > > > > > > Freeze is an actual physical write barrier. A very heavy handed > > > physical right barrier, yes, but it has very well defined and > > > bounded physical data persistence semantics. > > > > Yes. Freeze is a "write barrier to persistence storage". > > This is not what "vfs write barrier" is about. > > I will try to explain better. > > > > Some syscalls modify the data/metadata of filesystem objects in memory > > (a.k.a "in-core") and some syscalls query in-core data/metadata > > of filesystem objects. > > > > It is often the case that in-core data/metadata readers are not fully > > synchronized with in-core data/metadata writers and it is often that > > in-core data and metadata are not modified atomically w.r.t the > > in-core data/metadata readers. > > Even related metadata attributes are often not modified atomically > > w.r.t to their readers (e.g. statx()). > > > > When it comes to "observing changes" multigrain ctime/mtime has > > improved things a lot for observing a change in ctime/mtime since > > last sampled and for observing an order of ctime/mtime changes > > on different inodes, but it hasn't changed the fact that ctime/mtime > > changes can be observed *before* the respective metadata/data > > changes can be observed. > > > > An example problem is that a naive backup or indexing program can > > read old data/metadata with new timestamp T and wrongly conclude > > that it read all changes up to time T. > > > > It is true that "real" backup programs know that applications and > > filesystem needs to be quisences before backup, but actual > > day to day cloud storage sync programs and indexers cannot > > practically freeze the filesystem for their work. > > > > Right. That is still a known problem. For directory operations, the > i_rwsem keeps things consistent, but for regular files, it's possible > to see new timestamps alongside with old file contents. That's a > problem since caching algorithms that watch for timestamp changes can > end up not seeing the new contents until the _next_ change occurs, > which might not ever happen. > > It would be better to change the file write code to update the > timestamps after copying data to the pagecache. It would still be > possible in that case to see old attributes + new contents, but that's > preferable to the reverse for callers that are watching for changes to > attributes. > Yes, I remember this was discussed. I think it may make sense to update before and after copying data to page cache? > Would fixing that help your use-case at all? > I don't think it would, because my use case is not about querying the change status of a single inode. It post change timestamp update helps I don't see how. Thanks, Amir. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-23 18:14 ` Jeff Layton 2025-01-24 21:07 ` Amir Goldstein @ 2025-02-11 14:53 ` Jan Kara 2025-03-20 17:00 ` Amir Goldstein 1 sibling, 1 reply; 14+ messages in thread From: Jan Kara @ 2025-02-11 14:53 UTC (permalink / raw) To: Jeff Layton Cc: Amir Goldstein, Dave Chinner, linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik On Thu 23-01-25 13:14:11, Jeff Layton wrote: > On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote: > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote: > > > > Hi all, > > > > > > > > I would like to present the idea of vfs write barriers that was proposed by Jan > > > > and prototyped for the use of fanotify HSM change tracking events [1]. > > > > > > > > The historical records state that I had mentioned the idea briefly at the end of > > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss > > > > its wider implications at the time. > > > > > > > > The vfs write barriers are implemented by taking a per-sb srcu read side > > > > lock for the scope of {mnt,file}_{want,drop}_write(). > > > > > > > > This could be used by users - in the case of the prototype - an HSM service - > > > > to wait for all in-flight write syscalls, without blocking new write syscalls > > > > as the stricter fsfreeze() does. > > > > > > > > This ability to wait for in-flight write syscalls is used by the prototype to > > > > implement a crash consistent change tracking method [3] without the > > > > need to use the heavy fsfreeze() hammer. > > > > > > How does this provide anything guarantee at all? It doesn't order or > > > wait for physical IOs in any way, so writeback can be active on a > > > file and writing data from both sides of a syscall write "barrier". > > > i.e. there is no coherency between what is on disk, the cmtime of > > > the inode and the write barrier itself. > > > > > > Freeze is an actual physical write barrier. A very heavy handed > > > physical right barrier, yes, but it has very well defined and > > > bounded physical data persistence semantics. > > > > Yes. Freeze is a "write barrier to persistence storage". > > This is not what "vfs write barrier" is about. > > I will try to explain better. > > > > Some syscalls modify the data/metadata of filesystem objects in memory > > (a.k.a "in-core") and some syscalls query in-core data/metadata > > of filesystem objects. > > > > It is often the case that in-core data/metadata readers are not fully > > synchronized with in-core data/metadata writers and it is often that > > in-core data and metadata are not modified atomically w.r.t the > > in-core data/metadata readers. > > Even related metadata attributes are often not modified atomically > > w.r.t to their readers (e.g. statx()). > > > > When it comes to "observing changes" multigrain ctime/mtime has > > improved things a lot for observing a change in ctime/mtime since > > last sampled and for observing an order of ctime/mtime changes > > on different inodes, but it hasn't changed the fact that ctime/mtime > > changes can be observed *before* the respective metadata/data > > changes can be observed. > > > > An example problem is that a naive backup or indexing program can > > read old data/metadata with new timestamp T and wrongly conclude > > that it read all changes up to time T. > > > > It is true that "real" backup programs know that applications and > > filesystem needs to be quisences before backup, but actual > > day to day cloud storage sync programs and indexers cannot > > practically freeze the filesystem for their work. > > > > Right. That is still a known problem. For directory operations, the > i_rwsem keeps things consistent, but for regular files, it's possible > to see new timestamps alongside with old file contents. That's a > problem since caching algorithms that watch for timestamp changes can > end up not seeing the new contents until the _next_ change occurs, > which might not ever happen. > > It would be better to change the file write code to update the > timestamps after copying data to the pagecache. It would still be > possible in that case to see old attributes + new contents, but that's > preferable to the reverse for callers that are watching for changes to > attributes. > > Would fixing that help your use-case at all? I think Amir wanted to make here a point in the other direction: I.e., if the application did: * sample inode timestamp * vfs_write_barrier() * read file data then it is *guaranteed* it will never see old data & new timestamp and hence the caching problem is solved. No need to update timestamp after the write. Now I agree updating timestamps after write is much nicer from usability POV (given how common pattern above it) but this is just a simple example demonstrating possible uses for vfs_write_barrier(). Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-02-11 14:53 ` Jan Kara @ 2025-03-20 17:00 ` Amir Goldstein 2025-03-27 18:23 ` Amir Goldstein 0 siblings, 1 reply; 14+ messages in thread From: Amir Goldstein @ 2025-03-20 17:00 UTC (permalink / raw) To: Jan Kara Cc: Jeff Layton, Dave Chinner, linux-fsdevel, lsf-pc, Christian Brauner, Josef Bacik On Tue, Feb 11, 2025 at 5:22 PM Jan Kara <jack@suse.cz> wrote: > > On Thu 23-01-25 13:14:11, Jeff Layton wrote: > > On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote: > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote: > > > > > Hi all, > > > > > > > > > > I would like to present the idea of vfs write barriers that was proposed by Jan > > > > > and prototyped for the use of fanotify HSM change tracking events [1]. > > > > > > > > > > The historical records state that I had mentioned the idea briefly at the end of > > > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss > > > > > its wider implications at the time. > > > > > > > > > > The vfs write barriers are implemented by taking a per-sb srcu read side > > > > > lock for the scope of {mnt,file}_{want,drop}_write(). > > > > > > > > > > This could be used by users - in the case of the prototype - an HSM service - > > > > > to wait for all in-flight write syscalls, without blocking new write syscalls > > > > > as the stricter fsfreeze() does. > > > > > > > > > > This ability to wait for in-flight write syscalls is used by the prototype to > > > > > implement a crash consistent change tracking method [3] without the > > > > > need to use the heavy fsfreeze() hammer. > > > > > > > > How does this provide anything guarantee at all? It doesn't order or > > > > wait for physical IOs in any way, so writeback can be active on a > > > > file and writing data from both sides of a syscall write "barrier". > > > > i.e. there is no coherency between what is on disk, the cmtime of > > > > the inode and the write barrier itself. > > > > > > > > Freeze is an actual physical write barrier. A very heavy handed > > > > physical right barrier, yes, but it has very well defined and > > > > bounded physical data persistence semantics. > > > > > > Yes. Freeze is a "write barrier to persistence storage". > > > This is not what "vfs write barrier" is about. > > > I will try to explain better. > > > > > > Some syscalls modify the data/metadata of filesystem objects in memory > > > (a.k.a "in-core") and some syscalls query in-core data/metadata > > > of filesystem objects. > > > > > > It is often the case that in-core data/metadata readers are not fully > > > synchronized with in-core data/metadata writers and it is often that > > > in-core data and metadata are not modified atomically w.r.t the > > > in-core data/metadata readers. > > > Even related metadata attributes are often not modified atomically > > > w.r.t to their readers (e.g. statx()). > > > > > > When it comes to "observing changes" multigrain ctime/mtime has > > > improved things a lot for observing a change in ctime/mtime since > > > last sampled and for observing an order of ctime/mtime changes > > > on different inodes, but it hasn't changed the fact that ctime/mtime > > > changes can be observed *before* the respective metadata/data > > > changes can be observed. > > > > > > An example problem is that a naive backup or indexing program can > > > read old data/metadata with new timestamp T and wrongly conclude > > > that it read all changes up to time T. > > > > > > It is true that "real" backup programs know that applications and > > > filesystem needs to be quisences before backup, but actual > > > day to day cloud storage sync programs and indexers cannot > > > practically freeze the filesystem for their work. > > > > > > > Right. That is still a known problem. For directory operations, the > > i_rwsem keeps things consistent, but for regular files, it's possible > > to see new timestamps alongside with old file contents. That's a > > problem since caching algorithms that watch for timestamp changes can > > end up not seeing the new contents until the _next_ change occurs, > > which might not ever happen. > > > > It would be better to change the file write code to update the > > timestamps after copying data to the pagecache. It would still be > > possible in that case to see old attributes + new contents, but that's > > preferable to the reverse for callers that are watching for changes to > > attributes. > > > > Would fixing that help your use-case at all? > > I think Amir wanted to make here a point in the other direction: I.e., if > the application did: > * sample inode timestamp > * vfs_write_barrier() > * read file data > > then it is *guaranteed* it will never see old data & new timestamp and hence > the caching problem is solved. No need to update timestamp after the write. > > Now I agree updating timestamps after write is much nicer from usability > POV (given how common pattern above it) but this is just a simple example > demonstrating possible uses for vfs_write_barrier(). > I was trying to figure out if updating timestamp after write would be enough to deal with file writes and I think that it is not enough when adding signalling (events) into the picture. In this case, the consumer is expected to act on changes (e.g. index/backup) soon after they happen. I think this case is different from NFS cache which only cares about cache invalidation on file access(?). In any case, we need a FAN_PRE_MODIFY blocking event to store a persistent change intent record before the write - that is needed to find changes after a crash. Now unless we want to start polling ctime (and we do not want that), we need a signal to wake the consumer after the write to page cache One way is to rely on the FAN_MODIFY async event post write. But there is ambiguity in the existing FAN_MODIFY events: Thread A starts write on file F (no listener for FAN_PRE_MODIFY) Event consumer starts Thread B starts write on file F FAN_PRE_MODIFY(F) reported from thread B Thread A completes write on file F FAN_MODIFY(F) reported from thread A (or from aio completion thread) Event consumer believes it got the last event and can read the final version of F So if we use this method we will need a unique cookie to associate the POST_MODIFY with the PRE_MODIFY event. Something like this: writer [fsnotifyd] ------- ------------- file_start_write_usn() => FAN_PRE_MODIFY[ fsid, usn, fhandle ] { <= Record change intent before response …do some in-core changes (e.g. data + mode + ctime)... } file_end_write_usn() => FAN_POST_MODIFY[ fsid, usn, fhandle ] Consume changes after FAN_POST_MODIFY While this is a viable option, it adds yet more hooks and more events and it does not provide an easy way for consumers to wait for the completion of a batch of modifications. The vfs_write_barrier method provides a better way to wait for completion: writer [fsnotifyd] ------- ------------- file_start_write_srcu() => FAN_PRE_MODIFY[ fsid, usn, fhandle ] { <= Record change intent before response …do some in-core changes under srcu read lock (e.g. data + mode + ctime)... } file_end_write_srcu() synchronize_srcu() <= vfs_write_barrier(); Consume a batch of recorded changes after write barrier act on the changes and clear the change intent records I am hoping to be able to argue for the case of vfs_write_barrier() in LSFMM, but if this will not be acceptable, I can work with the post modify events solution. Thanks, Amir. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-03-20 17:00 ` Amir Goldstein @ 2025-03-27 18:23 ` Amir Goldstein 0 siblings, 0 replies; 14+ messages in thread From: Amir Goldstein @ 2025-03-27 18:23 UTC (permalink / raw) To: Jan Kara Cc: Jeff Layton, Dave Chinner, linux-fsdevel, lsf-pc, Christian Brauner, Josef Bacik On Thu, Mar 20, 2025 at 6:00 PM Amir Goldstein <amir73il@gmail.com> wrote: > > On Tue, Feb 11, 2025 at 5:22 PM Jan Kara <jack@suse.cz> wrote: > > > > On Thu 23-01-25 13:14:11, Jeff Layton wrote: > > > On Mon, 2025-01-20 at 12:41 +0100, Amir Goldstein wrote: > > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > > > > > > On Fri, Jan 17, 2025 at 07:01:50PM +0100, Amir Goldstein wrote: > > > > > > Hi all, > > > > > > > > > > > > I would like to present the idea of vfs write barriers that was proposed by Jan > > > > > > and prototyped for the use of fanotify HSM change tracking events [1]. > > > > > > > > > > > > The historical records state that I had mentioned the idea briefly at the end of > > > > > > my talk in LSFMM 2023 [2], but we did not really have a lot of time to discuss > > > > > > its wider implications at the time. > > > > > > > > > > > > The vfs write barriers are implemented by taking a per-sb srcu read side > > > > > > lock for the scope of {mnt,file}_{want,drop}_write(). > > > > > > > > > > > > This could be used by users - in the case of the prototype - an HSM service - > > > > > > to wait for all in-flight write syscalls, without blocking new write syscalls > > > > > > as the stricter fsfreeze() does. > > > > > > > > > > > > This ability to wait for in-flight write syscalls is used by the prototype to > > > > > > implement a crash consistent change tracking method [3] without the > > > > > > need to use the heavy fsfreeze() hammer. > > > > > > > > > > How does this provide anything guarantee at all? It doesn't order or > > > > > wait for physical IOs in any way, so writeback can be active on a > > > > > file and writing data from both sides of a syscall write "barrier". > > > > > i.e. there is no coherency between what is on disk, the cmtime of > > > > > the inode and the write barrier itself. > > > > > > > > > > Freeze is an actual physical write barrier. A very heavy handed > > > > > physical right barrier, yes, but it has very well defined and > > > > > bounded physical data persistence semantics. > > > > > > > > Yes. Freeze is a "write barrier to persistence storage". > > > > This is not what "vfs write barrier" is about. > > > > I will try to explain better. > > > > > > > > Some syscalls modify the data/metadata of filesystem objects in memory > > > > (a.k.a "in-core") and some syscalls query in-core data/metadata > > > > of filesystem objects. > > > > > > > > It is often the case that in-core data/metadata readers are not fully > > > > synchronized with in-core data/metadata writers and it is often that > > > > in-core data and metadata are not modified atomically w.r.t the > > > > in-core data/metadata readers. > > > > Even related metadata attributes are often not modified atomically > > > > w.r.t to their readers (e.g. statx()). > > > > > > > > When it comes to "observing changes" multigrain ctime/mtime has > > > > improved things a lot for observing a change in ctime/mtime since > > > > last sampled and for observing an order of ctime/mtime changes > > > > on different inodes, but it hasn't changed the fact that ctime/mtime > > > > changes can be observed *before* the respective metadata/data > > > > changes can be observed. > > > > > > > > An example problem is that a naive backup or indexing program can > > > > read old data/metadata with new timestamp T and wrongly conclude > > > > that it read all changes up to time T. > > > > > > > > It is true that "real" backup programs know that applications and > > > > filesystem needs to be quisences before backup, but actual > > > > day to day cloud storage sync programs and indexers cannot > > > > practically freeze the filesystem for their work. > > > > > > > > > > Right. That is still a known problem. For directory operations, the > > > i_rwsem keeps things consistent, but for regular files, it's possible > > > to see new timestamps alongside with old file contents. That's a > > > problem since caching algorithms that watch for timestamp changes can > > > end up not seeing the new contents until the _next_ change occurs, > > > which might not ever happen. > > > > > > It would be better to change the file write code to update the > > > timestamps after copying data to the pagecache. It would still be > > > possible in that case to see old attributes + new contents, but that's > > > preferable to the reverse for callers that are watching for changes to > > > attributes. > > > > > > Would fixing that help your use-case at all? > > > > I think Amir wanted to make here a point in the other direction: I.e., if > > the application did: > > * sample inode timestamp > > * vfs_write_barrier() > > * read file data > > > > then it is *guaranteed* it will never see old data & new timestamp and hence > > the caching problem is solved. No need to update timestamp after the write. > > > > Now I agree updating timestamps after write is much nicer from usability > > POV (given how common pattern above it) but this is just a simple example > > demonstrating possible uses for vfs_write_barrier(). > > > > I was trying to figure out if updating timestamp after write would be enough > to deal with file writes and I think that it is not enough when adding > signalling > (events) into the picture. > In this case, the consumer is expected to act on changes (e.g. index/backup) > soon after they happen. > I think this case is different from NFS cache which only cares about cache > invalidation on file access(?). > > In any case, we need a FAN_PRE_MODIFY blocking event to store a > persistent change intent record before the write - that is needed to find > changes after a crash. > > Now unless we want to start polling ctime (and we do not want that), > we need a signal to wake the consumer after the write to page cache > > One way is to rely on the FAN_MODIFY async event post write. > But there is ambiguity in the existing FAN_MODIFY events: > > Thread A starts write on file F (no listener for FAN_PRE_MODIFY) > Event consumer starts > Thread B starts write on file F > FAN_PRE_MODIFY(F) reported from thread B > Thread A completes write on file F > FAN_MODIFY(F) reported from thread A (or from aio completion thread) > Event consumer believes it got the last event and can read the final > version of F > > So if we use this method we will need a unique cookie to > associate the POST_MODIFY with the PRE_MODIFY event. > > Something like this: > > writer [fsnotifyd] > ------- ------------- > file_start_write_usn() => FAN_PRE_MODIFY[ fsid, usn, fhandle ] > { <= Record change intent before response > …do some in-core changes > (e.g. data + mode + ctime)... > } file_end_write_usn() => FAN_POST_MODIFY[ fsid, usn, fhandle ] > Consume changes after FAN_POST_MODIFY > > While this is a viable option, it adds yet more hooks and more > events and it does not provide an easy way for consumers to > wait for the completion of a batch of modifications. > > The vfs_write_barrier method provides a better way to wait for completion: > > writer [fsnotifyd] > ------- ------------- > file_start_write_srcu() => FAN_PRE_MODIFY[ fsid, usn, fhandle ] > { <= Record change intent before response > …do some in-core changes under srcu read lock > (e.g. data + mode + ctime)... > } file_end_write_srcu() > synchronize_srcu() <= vfs_write_barrier(); > Consume a batch of recorded changes after write barrier > act on the changes and clear the change intent records > > I am hoping to be able to argue for the case of vfs_write_barrier() > in LSFMM, but if this will not be acceptable, I can work with the > post modify events solution. > FYI, I had discussed it with some folks at LSFMM after my talk and what was apparent to me from this chat and also from the questions during my presentation, is that I did not succeed in explaining the problem. I believe that the path forward for me, which is something that Jan has told me from the beginning, is to implement a reference design of persistent change journal, because this is too complex of an API to discuss without the user code that uses it. I am still on the fence about whether I want to do a userspace fsnotifyfd or a kernel persistent change journal library/subsystem as a reference design. I do already have a kernel subsystem (ovl watch) so I may end up cleaning that one up to use a proper fanotify API and maybe that would be the way to do it. One more thing that I realised during LSFMM, is that some filesystems (e.g. NTFS, Lustre) already have an internal persistent change journal. If I implement a kernel persistent change journal subsystem, then we could use the same fanotify API to read events from fs that implements its own persistent change journal and from a fs that allows to use the fs agnostic persistent change journal. Thanks, Amir. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-20 11:41 ` Amir Goldstein 2025-01-23 0:34 ` Dave Chinner 2025-01-23 18:14 ` Jeff Layton @ 2025-01-27 23:34 ` Dave Chinner 2025-01-29 1:39 ` Amir Goldstein 2 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2025-01-27 23:34 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > This proposed write barrier does not seem capable of providing any > > sort of physical data or metadata/data write ordering guarantees, so > > I'm a bit lost in how it can be used to provide reliable "crash > > consistent change tracking" when there is no relationship between > > the data/metadata in memory and data/metadata on disk... > > That's a good question. A bit hard to explain but I will try. > > The short answer is that the vfs write barrier does *not* by itself > provide the guarantee for "crash consistent change tracking". > > In the prototype, the "crash consistent change tracking" guarantee > is provided by the fact that the change records are recorded as > as metadata in the same filesystem, prior to the modification and > those metadata records are strictly ordered by the filesystem before > the actual change. Uh, ok. I've read the docco and I think I understand what the prototype you've pointed me at is doing. It is using a separate chunk of the filesystem as a database to persist change records for data in the filesystem. It is doing this by creating an empty(?) file per change record in a per time period (T) directory instance. i.e. write() -> pre-modify -> fanotify -> userspace HSM -> create file in dir T named "<filehandle-other-stuff>" And then you're relying on the filesystem to make that directory entry T/<filehandle-other-stuff> stable before the data the pre-modify record was generated for ever gets written. IOWs, you've specifically relying on *all unrelated metadata changes in the filesystem* having strict global ordering *and* being persisted before any data written after the metadata was created is persisted. Sure, this might work right now on XFS because the journalling implementation -currently- provides global metadata ordering and data/metadata ordering based on IO completion to submission ordering. However, we do not guarantee that XFS will -always- have this behaviour. This is an *implementation detail*, not a guaranteed behaviour we will preserve for all time. i.e. we reserve the right to change how we do unrelated metadata and data/metadata ordering internally. This reminds of how applications observed that ext3 ordered mode didn't require fsync to guarantee the data got written before the metadata, so they elided the fsync() because it was really expensive on ext3. i.e. they started relying on a specific filesystem implementation detail for "correct crash consistency behaviour", without understanding that it -only worked on ext3- and broken crash consistency behaviour on all other filesystems. That was *bad*, and it took a long time to get the message across that applications *must* use fsync() for correct crash consistency behaviour... What you are describing for your prototype HSM to provide crash consistent change tracking really seems to me like it is reliant on the side effects of specific filesystem implementation choices, not a behaviour that all filesysetms guarantee. i.e. not all filesystems provide strict global metadata ordering semantics, and some fs maintainers are on record explicitly stating that they will not provide or guarantee them. e.g. ext4, especially with fast commits enabled, will not provide global strictly ordered metadata semantics. btrfs also doesn't provide such a guarantee, either. > I would love to discuss the merits and pitfalls of this method, but the > main thing I wanted to get feedback on is whether anyone finds the > described vfs API useful for anything other that the change tracking > system that I described. If my understanding is correct, then this HSM prototype change tracking mechanism seems like a fragile, unsupportable architecture. I don't think we should be trying to add new VFS infrastructure to make it work, because I think the underlying behaviours it requires from filesystems are simply not guaranteed to exist. -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-27 23:34 ` Dave Chinner @ 2025-01-29 1:39 ` Amir Goldstein 2025-02-11 21:12 ` Dave Chinner 0 siblings, 1 reply; 14+ messages in thread From: Amir Goldstein @ 2025-01-29 1:39 UTC (permalink / raw) To: Dave Chinner Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > > This proposed write barrier does not seem capable of providing any > > > sort of physical data or metadata/data write ordering guarantees, so > > > I'm a bit lost in how it can be used to provide reliable "crash > > > consistent change tracking" when there is no relationship between > > > the data/metadata in memory and data/metadata on disk... > > > > That's a good question. A bit hard to explain but I will try. > > > > The short answer is that the vfs write barrier does *not* by itself > > provide the guarantee for "crash consistent change tracking". > > > > In the prototype, the "crash consistent change tracking" guarantee > > is provided by the fact that the change records are recorded as > > as metadata in the same filesystem, prior to the modification and > > those metadata records are strictly ordered by the filesystem before > > the actual change. > > Uh, ok. > > I've read the docco and I think I understand what the prototype > you've pointed me at is doing. > > It is using a separate chunk of the filesystem as a database to > persist change records for data in the filesystem. It is doing this > by creating an empty(?) file per change record in a per time > period (T) directory instance. > > i.e. > > write() > -> pre-modify > -> fanotify > -> userspace HSM > -> create file in dir T named "<filehandle-other-stuff>" > > And then you're relying on the filesystem to make that directory > entry T/<filehandle-other-stuff> stable before the data the > pre-modify record was generated for ever gets written. > Yes. > IOWs, you've specifically relying on *all unrelated metadata changes > in the filesystem* having strict global ordering *and* being > persisted before any data written after the metadata was created > is persisted. > > Sure, this might work right now on XFS because the journalling > implementation -currently- provides global metadata ordering and > data/metadata ordering based on IO completion to submission > ordering. > Yes. > However, we do not guarantee that XFS will -always- have this > behaviour. This is an *implementation detail*, not a guaranteed > behaviour we will preserve for all time. i.e. we reserve the right > to change how we do unrelated metadata and data/metadata ordering > internally. > Yes, that's why its a prototype, but its a userspace prototype. The requirements from the kernel API won't change if the userspace server would have used an independent nvram to store the change record. > This reminds of how applications observed that ext3 ordered mode > didn't require fsync to guarantee the data got written before the > metadata, so they elided the fsync() because it was really expensive > on ext3. i.e. they started relying on a specific filesystem > implementation detail for "correct crash consistency behaviour", > without understanding that it -only worked on ext3- and broken crash > consistency behaviour on all other filesystems. That was *bad*, and > it took a long time to get the message across that applications > *must* use fsync() for correct crash consistency behaviour... I am familiar with that episode. > > What you are describing for your prototype HSM to provide crash > consistent change tracking really seems to me like it is reliant > on the side effects of specific filesystem implementation choices, > not a behaviour that all filesysetms guarantee. > > i.e. not all filesystems provide strict global metadata ordering > semantics, and some fs maintainers are on record explicitly stating > that they will not provide or guarantee them. e.g. ext4, especially > with fast commits enabled, will not provide global strictly ordered > metadata semantics. btrfs also doesn't provide such a guarantee, > either. > Right. We did once a proposal to formalize this contract [1], but its a bit off topic. > > I would love to discuss the merits and pitfalls of this method, but the > > main thing I wanted to get feedback on is whether anyone finds the > > described vfs API useful for anything other that the change tracking > > system that I described. > > If my understanding is correct, then this HSM prototype change > tracking mechanism seems like a fragile, unsupportable architecture. > I don't think we should be trying to add new VFS infrastructure to > make it work, because I think the underlying behaviours it requires > from filesystems are simply not guaranteed to exist. > That's a valid opinion. Do you have an idea for a better design for fs agnostic change tracking? I mean, sure, we can re-implement DMAPI in specific fs, but I don't think anyone would like that. IMO The metadata ordering contract is a technical matter that could be fixed. I still hold the opinion that the in-core changes order w.r.t readers is a problem regardless of persistence to disk, but I may need to come up with more compelling use cases to demonstrate this problem. Thanks, Amir. [1] https://lore.kernel.org/linux-fsdevel/CAOQ4uxjZm6E2TmCv8JOyQr7f-2VB0uFRy7XEp8HBHQmMdQg+6w@mail.gmail.com/ ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-01-29 1:39 ` Amir Goldstein @ 2025-02-11 21:12 ` Dave Chinner 2025-02-12 8:29 ` Amir Goldstein 0 siblings, 1 reply; 14+ messages in thread From: Dave Chinner @ 2025-02-11 21:12 UTC (permalink / raw) To: Amir Goldstein Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton On Wed, Jan 29, 2025 at 02:39:56AM +0100, Amir Goldstein wrote: > On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@fromorbit.com> wrote: > > > > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > > > This proposed write barrier does not seem capable of providing any > > > > sort of physical data or metadata/data write ordering guarantees, so > > > > I'm a bit lost in how it can be used to provide reliable "crash > > > > consistent change tracking" when there is no relationship between > > > > the data/metadata in memory and data/metadata on disk... > > > > > > That's a good question. A bit hard to explain but I will try. > > > > > > The short answer is that the vfs write barrier does *not* by itself > > > provide the guarantee for "crash consistent change tracking". > > > > > > In the prototype, the "crash consistent change tracking" guarantee > > > is provided by the fact that the change records are recorded as > > > as metadata in the same filesystem, prior to the modification and > > > those metadata records are strictly ordered by the filesystem before > > > the actual change. > > > > Uh, ok. > > > > I've read the docco and I think I understand what the prototype > > you've pointed me at is doing. > > > > It is using a separate chunk of the filesystem as a database to > > persist change records for data in the filesystem. It is doing this > > by creating an empty(?) file per change record in a per time > > period (T) directory instance. > > > > i.e. > > > > write() > > -> pre-modify > > -> fanotify > > -> userspace HSM > > -> create file in dir T named "<filehandle-other-stuff>" > > > > And then you're relying on the filesystem to make that directory > > entry T/<filehandle-other-stuff> stable before the data the > > pre-modify record was generated for ever gets written. > > > > Yes. > > > IOWs, you've specifically relying on *all unrelated metadata changes > > in the filesystem* having strict global ordering *and* being > > persisted before any data written after the metadata was created > > is persisted. > > > > Sure, this might work right now on XFS because the journalling > > implementation -currently- provides global metadata ordering and > > data/metadata ordering based on IO completion to submission > > ordering. > > > > Yes. [....] > > > I would love to discuss the merits and pitfalls of this method, but the > > > main thing I wanted to get feedback on is whether anyone finds the > > > described vfs API useful for anything other that the change tracking > > > system that I described. > > > > If my understanding is correct, then this HSM prototype change > > tracking mechanism seems like a fragile, unsupportable architecture. > > I don't think we should be trying to add new VFS infrastructure to > > make it work, because I think the underlying behaviours it requires > > from filesystems are simply not guaranteed to exist. > > > > That's a valid opinion. > > Do you have an idea for a better design for fs agnostic change tracking? Store your HSM metadata in a database on a different storage device and only signal the pre-modification notification as complete once the database has completed it's update transaction. > I mean, sure, we can re-implement DMAPI in specific fs, but I don't think > anyone would like that. DMAPI pre-modification notifications didn't rely on side effects of filesystem behaviour for correctness. The HSM had to guarantee that it's recording of events were stable before it allowed the modification to be done. Lots of dmapi modification notifications used pre- and post- event notifications so the HSM could keep track of modifications that were in flight at any given point in time. That way the HSM recovery process knew after a crash which files it needed to go look at to determine if the operation in progress had completed or not once the system came back up.... > IMO The metadata ordering contract is a technical matter that could be fixed. > > I still hold the opinion that the in-core changes order w.r.t readers > is a problem > regardless of persistence to disk, but I may need to come up with more > compelling > use cases to demonstrate this problem. IIRC, the XFS DMAPI implementation solved that problem by blocking read notifications whilst there was a pending modification notification outstanding. The problem with the Linux DMAPI implementation of this (one of the show stoppers that prevented merge) was that it held a rwsem across syscall contexts to provide this functionality..... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: [LSF/MM/BPF TOPIC] vfs write barriers 2025-02-11 21:12 ` Dave Chinner @ 2025-02-12 8:29 ` Amir Goldstein 0 siblings, 0 replies; 14+ messages in thread From: Amir Goldstein @ 2025-02-12 8:29 UTC (permalink / raw) To: Dave Chinner Cc: linux-fsdevel, lsf-pc, Jan Kara, Christian Brauner, Josef Bacik, Jeff Layton On Tue, Feb 11, 2025 at 10:12 PM Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Jan 29, 2025 at 02:39:56AM +0100, Amir Goldstein wrote: > > On Tue, Jan 28, 2025 at 12:34 AM Dave Chinner <david@fromorbit.com> wrote: > > > > > > On Mon, Jan 20, 2025 at 12:41:33PM +0100, Amir Goldstein wrote: > > > > On Sun, Jan 19, 2025 at 10:15 PM Dave Chinner <david@fromorbit.com> wrote: > > > > > This proposed write barrier does not seem capable of providing any > > > > > sort of physical data or metadata/data write ordering guarantees, so > > > > > I'm a bit lost in how it can be used to provide reliable "crash > > > > > consistent change tracking" when there is no relationship between > > > > > the data/metadata in memory and data/metadata on disk... > > > > > > > > That's a good question. A bit hard to explain but I will try. > > > > > > > > The short answer is that the vfs write barrier does *not* by itself > > > > provide the guarantee for "crash consistent change tracking". > > > > > > > > In the prototype, the "crash consistent change tracking" guarantee > > > > is provided by the fact that the change records are recorded as > > > > as metadata in the same filesystem, prior to the modification and > > > > those metadata records are strictly ordered by the filesystem before > > > > the actual change. > > > > > > Uh, ok. > > > > > > I've read the docco and I think I understand what the prototype > > > you've pointed me at is doing. > > > > > > It is using a separate chunk of the filesystem as a database to > > > persist change records for data in the filesystem. It is doing this > > > by creating an empty(?) file per change record in a per time > > > period (T) directory instance. > > > > > > i.e. > > > > > > write() > > > -> pre-modify > > > -> fanotify > > > -> userspace HSM > > > -> create file in dir T named "<filehandle-other-stuff>" > > > > > > And then you're relying on the filesystem to make that directory > > > entry T/<filehandle-other-stuff> stable before the data the > > > pre-modify record was generated for ever gets written. > > > > > > > Yes. > > > > > IOWs, you've specifically relying on *all unrelated metadata changes > > > in the filesystem* having strict global ordering *and* being > > > persisted before any data written after the metadata was created > > > is persisted. > > > > > > Sure, this might work right now on XFS because the journalling > > > implementation -currently- provides global metadata ordering and > > > data/metadata ordering based on IO completion to submission > > > ordering. > > > > > > > Yes. > > [....] > > > > > I would love to discuss the merits and pitfalls of this method, but the > > > > main thing I wanted to get feedback on is whether anyone finds the > > > > described vfs API useful for anything other that the change tracking > > > > system that I described. > > > > > > If my understanding is correct, then this HSM prototype change > > > tracking mechanism seems like a fragile, unsupportable architecture. > > > I don't think we should be trying to add new VFS infrastructure to > > > make it work, because I think the underlying behaviours it requires > > > from filesystems are simply not guaranteed to exist. > > > > > > > That's a valid opinion. > > > > Do you have an idea for a better design for fs agnostic change tracking? > > Store your HSM metadata in a database on a different storage device > and only signal the pre-modification notification as complete once > the database has completed it's update transaction. > Yes, naturally. This was exactly my point in saying that on-disk persistence is completely orthogonal to the purpose for which sb_write_barrier() API is being proposed. > > I mean, sure, we can re-implement DMAPI in specific fs, but I don't think > > anyone would like that. > > DMAPI pre-modification notifications didn't rely on side effects of > filesystem behaviour for correctness. Neither does fanotify. My HSM prototype is relying on some XFS side effects. A production HSM using the same fanotify API could store changes in a db on another fs or on persistent memory. > The HSM had to guarantee that > it's recording of events were stable before it allowed the > modification to be done. No change in methodology here. > Lots of dmapi modification notifications > used pre- and post- event notifications so the HSM could keep track > of modifications that were in flight at any given point in time. > OK, now we are talking about the relevant point. Persistent "recording" an intent to change on pre- is fine. "Notifying" the application that change has been done in pre- is racy, because the application may wrongly believe that it has already consumed the notified/recorded change. Complementing every single pre- event with a matching post- event is one possible solution and I think Jan and I discussed it as well. sb_write_barrier() is a much easier API for HSM, because HSM rarely needs to consume a single change, it is much more likely to consume a large batch of changes, so the sb_write_barrier() API is a much more efficient way of getting the same guarantee that "All the changes recorded with pre- events are observable". > That way the HSM recovery process knew after a crash which files it > needed to go look at to determine if the operation in progress had > completed or not once the system came back up.... > Yes, exactly what we need and what sb_write_barrier() helps to achieve. > > IMO The metadata ordering contract is a technical matter that could be fixed. > > > > I still hold the opinion that the in-core changes order w.r.t readers > > is a problem > > regardless of persistence to disk, but I may need to come up with more > > compelling > > use cases to demonstrate this problem. > > IIRC, the XFS DMAPI implementation solved that problem by blocking > read notifications whilst there was a pending modification > notification outstanding. The problem with the Linux DMAPI > implementation of this (one of the show stoppers that prevented > merge) was that it held a rwsem across syscall contexts to provide > this functionality..... > sb_write_barrier() allows HSM to archive the same end result without holding rwsem across syscalls context. It's literally SRCU instead of the DMAPI rwsem. Not more, not less: sb_start_write_srcu() --> notify change intent --> HSM record to changes db <-- ack change intent recorded <-- ... make in-core changes ... <-- wait for changes in-flight <-- sb_write_barrier() sb_end_write_srcu() --> ack changes in-flight --> <-- persist recorded changes <-- syncfs() persist in-core changes --> ack persist changes --> HSM notify change consumers Thanks, Amir. ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2025-03-27 18:23 UTC | newest] Thread overview: 14+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-17 18:01 [LSF/MM/BPF TOPIC] vfs write barriers Amir Goldstein 2025-01-19 21:15 ` Dave Chinner 2025-01-20 11:41 ` Amir Goldstein 2025-01-23 0:34 ` Dave Chinner 2025-01-23 14:01 ` Amir Goldstein 2025-01-23 18:14 ` Jeff Layton 2025-01-24 21:07 ` Amir Goldstein 2025-02-11 14:53 ` Jan Kara 2025-03-20 17:00 ` Amir Goldstein 2025-03-27 18:23 ` Amir Goldstein 2025-01-27 23:34 ` Dave Chinner 2025-01-29 1:39 ` Amir Goldstein 2025-02-11 21:12 ` Dave Chinner 2025-02-12 8:29 ` Amir Goldstein
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox