replacement i_version counter for xfs

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* replacement i_version counter for xfs
@ 2023-01-24 12:56 Jeff Layton
  2023-01-25  0:02 ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2023-01-24 12:56 UTC (permalink / raw)
  To: linux-xfs; +Cc: linux-fsdevel

A few months ago, I posted a patch to make xfs not bump its i_version
counter on atime updates. Dave Chinner NAK'ed that patch, mentioning
that xfs would need to replace it with an entirely new field as the
existing counter is used for other purposes and its semantics are set in
stone.

Has anything been done toward that end? Should I file a bug report or
something? It would be nice to get this going so that XFS can continue
to be a viable filesystem for exporting via NFS.

Thanks,
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-24 12:56 replacement i_version counter for xfs Jeff Layton
@ 2023-01-25  0:02 ` Dave Chinner
  2023-01-25 11:47   ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2023-01-25  0:02 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-xfs, linux-fsdevel

On Tue, Jan 24, 2023 at 07:56:09AM -0500, Jeff Layton wrote:
> A few months ago, I posted a patch to make xfs not bump its i_version
> counter on atime updates. Dave Chinner NAK'ed that patch, mentioning
> that xfs would need to replace it with an entirely new field as the
> existing counter is used for other purposes and its semantics are set in
> stone.
> 
> Has anything been done toward that end?

No, because we don't have official specification of the behaviour
the nfsd subsystem requires merged into the kernel yet.

> Should I file a bug report or something?

There's nothing we can really do until the new specification is set
in stone. Filing a bug report won't change anything material.

As it is, I'm guessing that you desire the behaviour to be as you
described in the iversion patchset you just posted. That is
effectively:

  * The change attribute (i_version) is mandated by NFSv4 and is mostly for
  * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
- * appear different to observers if there was a change to the inode's data or
- * metadata since it was last queried.
+ * appear larger to observers if there was an explicit change to the inode's
+ * data or metadata since it was last queried.

i.e. the definition is changing from *any* metadata or data change
to *explicit* metadata/data changes, right? i.e. it should only
change when ctime changes?

IIUC the rest of the justification for i_version is that ctime might
lack the timestamp granularity to disambiguate sub-timestamp
granularity changes, so i_version is needed to bridge that gap.

Given that XFS has nanosecond timestamp resolution in the on-disk
format, both i_version and ctime changes are journalled, and
ctime/i_version will always change at exactly the same time in the
same transactions, there are no inherent sub-timestamp granularity
problems with ctime within XFS. Any deficiency in ctime resolution
comes solely from the granularity of the VFS inode timestamp
functions.

And so if current_time() was to provide fine-grained nanosecond
timestamp resolution for exported XFS filesystems (i.e. use
ktime_get_real_ts64() conditionally), then it seems to me that the
nfsd i_version function becomes completely redundant.

i.e. we are pretty much guaranteed that ctime on exported
filesystems will always be different for explicit modifications to
the same inode, and hence we can just use ctime as the version
change identifier without needing any on-disk format changes at all.

And we can optimise away that overhead when the filesystem is not
exported by just using the coarse timestamps because there is no
need for sub-timer-tick disambiguation of single file
modifications....

Hence it appears to me that with the new i_version specification
that there's an avenue out of this problem entirely that is "nfsd
needs to use ctime, not i_version". This solution seems generic
enough that filesystems with existing on-disk nanosecond timestamp
granularity would no longer need explicit on-disk support for the
nfsd i_version functionality, yes?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-25  0:02 ` Dave Chinner
@ 2023-01-25 11:47   ` Jeff Layton
  2023-01-25 16:32     ` Darrick J. Wong
  2023-01-30  1:54     ` Dave Chinner
  0 siblings, 2 replies; 13+ messages in thread
From: Jeff Layton @ 2023-01-25 11:47 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> On Tue, Jan 24, 2023 at 07:56:09AM -0500, Jeff Layton wrote:
> > A few months ago, I posted a patch to make xfs not bump its i_version
> > counter on atime updates. Dave Chinner NAK'ed that patch, mentioning
> > that xfs would need to replace it with an entirely new field as the
> > existing counter is used for other purposes and its semantics are set in
> > stone.
> > 
> > Has anything been done toward that end?
> 
> No, because we don't have official specification of the behaviour
> the nfsd subsystem requires merged into the kernel yet.
> 

Ok. Hopefully that will be addressed in v6.3.

> > Should I file a bug report or something?
> 
> There's nothing we can really do until the new specification is set
> in stone. Filing a bug report won't change anything material.
> 
> As it is, I'm guessing that you desire the behaviour to be as you
> described in the iversion patchset you just posted. That is
> effectively:
> 
>   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
>   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> - * appear different to observers if there was a change to the inode's data or
> - * metadata since it was last queried.
> + * appear larger to observers if there was an explicit change to the inode's
> + * data or metadata since it was last queried.
> 
> i.e. the definition is changing from *any* metadata or data change
> to *explicit* metadata/data changes, right? i.e. it should only
> change when ctime changes?
> 

Yes.

> IIUC the rest of the justification for i_version is that ctime might
> lack the timestamp granularity to disambiguate sub-timestamp
> granularity changes, so i_version is needed to bridge that gap.
> 
> Given that XFS has nanosecond timestamp resolution in the on-disk
> format, both i_version and ctime changes are journalled, and
> ctime/i_version will always change at exactly the same time in the
> same transactions, there are no inherent sub-timestamp granularity
> problems with ctime within XFS. Any deficiency in ctime resolution
> comes solely from the granularity of the VFS inode timestamp
> functions.
> 
> And so if current_time() was to provide fine-grained nanosecond
> timestamp resolution for exported XFS filesystems (i.e. use
> ktime_get_real_ts64() conditionally), then it seems to me that the
> nfsd i_version function becomes completely redundant.
> 
> i.e. we are pretty much guaranteed that ctime on exported
> filesystems will always be different for explicit modifications to
> the same inode, and hence we can just use ctime as the version
> change identifier without needing any on-disk format changes at all.
> 
> And we can optimise away that overhead when the filesystem is not
> exported by just using the coarse timestamps because there is no
> need for sub-timer-tick disambiguation of single file
> modifications....
> 

Ok, so conditional on (maybe) a per fstype flag, and whether the
filesystem is exported?

It's not trivial to tell whether something is exported though. We
typically only do that sort of checking within nfsd. That involves an
upcall into mountd, at a minimum.

I don't think you want to be plumbing calls to exportfs into xfs for
this. It may be simpler to just add a new on-disk counter and be done
with it.

> Hence it appears to me that with the new i_version specification
> that there's an avenue out of this problem entirely that is "nfsd
> needs to use ctime, not i_version". This solution seems generic
> enough that filesystems with existing on-disk nanosecond timestamp
> granularity would no longer need explicit on-disk support for the
> nfsd i_version functionality, yes?
> 

Pretty much.

My understanding has always been that it's not the on-disk format that's
the limiting factor, but the resolution of in-kernel timestamp sources.
If ktime_get_real_ts64 has real ns granularity, then that should be
sufficient (at least for the moment). I'm unclear on the performance
implications with such a change though.

You had also mentioned a while back that there was some desire for
femtosecond resolution on timestamps. Does that change the calculus here
at all? Note that the i_version is not subject to any timestamp
granularity issues.

If you want nfsd to start using the ctime for i_version with xfs, then
you can just turn off the SB_I_IVERSION flag. You will need to do some
work though to keep your "special" i_version that also counts atime
updates working once you turn that off. You'll probably want to do that
anyway though since the semantics for xfs's version counter are
different from everyone else's.

If this is what you choose to do for xfs, then the question becomes: who
is going to do that timestamp rework?

Note that there are two other lingering issues with i_version. Neither
of these are xfs-specific, but they may inform the changes you want to
make there:

1/ the ctime and i_version can roll backward on a crash.

2/ the ctime and i_version are both currently updated before write data
is copied to the pagecache. It would be ideal if that were done
afterward instead. (FWIW, I have some draft patches for btrfs and ext4
for this, but they need a lot more testing.)

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-25 11:47   ` Jeff Layton
@ 2023-01-25 16:32     ` Darrick J. Wong
  2023-01-25 17:58       ` Jeff Layton
  2023-01-30  1:54     ` Dave Chinner
  1 sibling, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2023-01-25 16:32 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > On Tue, Jan 24, 2023 at 07:56:09AM -0500, Jeff Layton wrote:
> > > A few months ago, I posted a patch to make xfs not bump its i_version
> > > counter on atime updates. Dave Chinner NAK'ed that patch, mentioning
> > > that xfs would need to replace it with an entirely new field as the
> > > existing counter is used for other purposes and its semantics are set in
> > > stone.
> > > 
> > > Has anything been done toward that end?
> > 
> > No, because we don't have official specification of the behaviour
> > the nfsd subsystem requires merged into the kernel yet.
> > 
> 
> Ok. Hopefully that will be addressed in v6.3.
> 
> > > Should I file a bug report or something?
> > 
> > There's nothing we can really do until the new specification is set
> > in stone. Filing a bug report won't change anything material.
> > 
> > As it is, I'm guessing that you desire the behaviour to be as you
> > described in the iversion patchset you just posted. That is
> > effectively:
> > 
> >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > - * appear different to observers if there was a change to the inode's data or
> > - * metadata since it was last queried.
> > + * appear larger to observers if there was an explicit change to the inode's
> > + * data or metadata since it was last queried.
> > 
> > i.e. the definition is changing from *any* metadata or data change
> > to *explicit* metadata/data changes, right? i.e. it should only
> > change when ctime changes?
> > 
> 
> Yes.
> 
> > IIUC the rest of the justification for i_version is that ctime might
> > lack the timestamp granularity to disambiguate sub-timestamp
> > granularity changes, so i_version is needed to bridge that gap.
> > 
> > Given that XFS has nanosecond timestamp resolution in the on-disk
> > format, both i_version and ctime changes are journalled, and
> > ctime/i_version will always change at exactly the same time in the
> > same transactions, there are no inherent sub-timestamp granularity
> > problems with ctime within XFS. Any deficiency in ctime resolution
> > comes solely from the granularity of the VFS inode timestamp
> > functions.
> > 
> > And so if current_time() was to provide fine-grained nanosecond
> > timestamp resolution for exported XFS filesystems (i.e. use
> > ktime_get_real_ts64() conditionally), then it seems to me that the
> > nfsd i_version function becomes completely redundant.
> > 
> > i.e. we are pretty much guaranteed that ctime on exported
> > filesystems will always be different for explicit modifications to
> > the same inode, and hence we can just use ctime as the version
> > change identifier without needing any on-disk format changes at all.
> > 
> > And we can optimise away that overhead when the filesystem is not
> > exported by just using the coarse timestamps because there is no
> > need for sub-timer-tick disambiguation of single file
> > modifications....
> > 
> 
> Ok, so conditional on (maybe) a per fstype flag, and whether the
> filesystem is exported?
> 
> It's not trivial to tell whether something is exported though. We
> typically only do that sort of checking within nfsd. That involves an
> upcall into mountd, at a minimum.
> 
> I don't think you want to be plumbing calls to exportfs into xfs for
> this. It may be simpler to just add a new on-disk counter and be done
> with it.

Simpler for you, maybe.  Ondisk format changes are a PITA to evaluate
and come with a long support burden.  We'd also have to write
xfs-specific testcases to ensure that the counter updates according to
specification.

Poking the kernel to provide sub-jiffies timestamp granularity when
required stays within the existing ondisk format, can be added to any
filesystem with sufficient timestamp granularity, and can be the subject
of a generic/ vfs test.

I also wonder if it's even necessary to use ktime_get_real_ts64 in all
cases -- can we sample the coarse granularity timestamp, and only go for
the higher resolution one if the first matches the ctime?

> > Hence it appears to me that with the new i_version specification
> > that there's an avenue out of this problem entirely that is "nfsd
> > needs to use ctime, not i_version". This solution seems generic
> > enough that filesystems with existing on-disk nanosecond timestamp
> > granularity would no longer need explicit on-disk support for the
> > nfsd i_version functionality, yes?
> > 
> 
> Pretty much.
> 
> My understanding has always been that it's not the on-disk format that's
> the limiting factor, but the resolution of in-kernel timestamp sources.
> If ktime_get_real_ts64 has real ns granularity, then that should be
> sufficient (at least for the moment). I'm unclear on the performance
> implications with such a change though.

I bet you can find some arm board or something with a terrible
clocksource that will take forever to produce high resolution timestamps
and get it wrong.

> You had also mentioned a while back that there was some desire for
> femtosecond resolution on timestamps. Does that change the calculus here
> at all? Note that the i_version is not subject to any timestamp
> granularity issues.

I personally don't care to go enlarge xfs timestamps even further to
support sub-ns resolution, but I see the theoretical argument for
needing them on an 8GHz Intel i9-13900KS(paceheater)...

> If you want nfsd to start using the ctime for i_version with xfs, then
> you can just turn off the SB_I_IVERSION flag. You will need to do some
> work though to keep your "special" i_version that also counts atime
> updates working once you turn that off. You'll probably want to do that
> anyway though since the semantics for xfs's version counter are
> different from everyone else's.
> 
> If this is what you choose to do for xfs, then the question becomes: who
> is going to do that timestamp rework?
> 
> Note that there are two other lingering issues with i_version. Neither
> of these are xfs-specific, but they may inform the changes you want to
> make there:
> 
> 1/ the ctime and i_version can roll backward on a crash.
> 
> 2/ the ctime and i_version are both currently updated before write data
> is copied to the pagecache. It would be ideal if that were done
> afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> for this, but they need a lot more testing.)

You might also want some means for xfs to tell the vfs that it already
did the timestamp update (because, say, we had to allocate blocks).
I wonder what people will say when we have to run a transaction before
the write to peel off suid bits and another one after to update ctime.

--D

> 
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-25 16:32     ` Darrick J. Wong
@ 2023-01-25 17:58       ` Jeff Layton
  2023-01-30  2:05         ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2023-01-25 17:58 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, linux-fsdevel

On Wed, 2023-01-25 at 08:32 -0800, Darrick J. Wong wrote:
> On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > > On Tue, Jan 24, 2023 at 07:56:09AM -0500, Jeff Layton wrote:
> > > > A few months ago, I posted a patch to make xfs not bump its i_version
> > > > counter on atime updates. Dave Chinner NAK'ed that patch, mentioning
> > > > that xfs would need to replace it with an entirely new field as the
> > > > existing counter is used for other purposes and its semantics are set in
> > > > stone.
> > > > 
> > > > Has anything been done toward that end?
> > > 
> > > No, because we don't have official specification of the behaviour
> > > the nfsd subsystem requires merged into the kernel yet.
> > > 
> > 
> > Ok. Hopefully that will be addressed in v6.3.
> > 
> > > > Should I file a bug report or something?
> > > 
> > > There's nothing we can really do until the new specification is set
> > > in stone. Filing a bug report won't change anything material.
> > > 
> > > As it is, I'm guessing that you desire the behaviour to be as you
> > > described in the iversion patchset you just posted. That is
> > > effectively:
> > > 
> > >   * The change attribute (i_version) is mandated by NFSv4 and is mostly for
> > >   * knfsd, but is also used for other purposes (e.g. IMA). The i_version must
> > > - * appear different to observers if there was a change to the inode's data or
> > > - * metadata since it was last queried.
> > > + * appear larger to observers if there was an explicit change to the inode's
> > > + * data or metadata since it was last queried.
> > > 
> > > i.e. the definition is changing from *any* metadata or data change
> > > to *explicit* metadata/data changes, right? i.e. it should only
> > > change when ctime changes?
> > > 
> > 
> > Yes.
> > 
> > > IIUC the rest of the justification for i_version is that ctime might
> > > lack the timestamp granularity to disambiguate sub-timestamp
> > > granularity changes, so i_version is needed to bridge that gap.
> > > 
> > > Given that XFS has nanosecond timestamp resolution in the on-disk
> > > format, both i_version and ctime changes are journalled, and
> > > ctime/i_version will always change at exactly the same time in the
> > > same transactions, there are no inherent sub-timestamp granularity
> > > problems with ctime within XFS. Any deficiency in ctime resolution
> > > comes solely from the granularity of the VFS inode timestamp
> > > functions.
> > > 
> > > And so if current_time() was to provide fine-grained nanosecond
> > > timestamp resolution for exported XFS filesystems (i.e. use
> > > ktime_get_real_ts64() conditionally), then it seems to me that the
> > > nfsd i_version function becomes completely redundant.
> > > 
> > > i.e. we are pretty much guaranteed that ctime on exported
> > > filesystems will always be different for explicit modifications to
> > > the same inode, and hence we can just use ctime as the version
> > > change identifier without needing any on-disk format changes at all.
> > > 
> > > And we can optimise away that overhead when the filesystem is not
> > > exported by just using the coarse timestamps because there is no
> > > need for sub-timer-tick disambiguation of single file
> > > modifications....
> > > 
> > 
> > Ok, so conditional on (maybe) a per fstype flag, and whether the
> > filesystem is exported?
> > 
> > It's not trivial to tell whether something is exported though. We
> > typically only do that sort of checking within nfsd. That involves an
> > upcall into mountd, at a minimum.
> > 
> > I don't think you want to be plumbing calls to exportfs into xfs for
> > this. It may be simpler to just add a new on-disk counter and be done
> > with it.
> 
> Simpler for you, maybe.  Ondisk format changes are a PITA to evaluate
> and come with a long support burden.  We'd also have to write
> xfs-specific testcases to ensure that the counter updates according to
> specification.
> 
> Poking the kernel to provide sub-jiffies timestamp granularity when
> required stays within the existing ondisk format, can be added to any
> filesystem with sufficient timestamp granularity, and can be the subject
> of a generic/ vfs test.
> 
> I also wonder if it's even necessary to use ktime_get_real_ts64 in all
> cases -- can we sample the coarse granularity timestamp, and only go for
> the higher resolution one if the first matches the ctime?
> 
> > > Hence it appears to me that with the new i_version specification
> > > that there's an avenue out of this problem entirely that is "nfsd
> > > needs to use ctime, not i_version". This solution seems generic
> > > enough that filesystems with existing on-disk nanosecond timestamp
> > > granularity would no longer need explicit on-disk support for the
> > > nfsd i_version functionality, yes?
> > > 
> > 
> > Pretty much.
> > 
> > My understanding has always been that it's not the on-disk format that's
> > the limiting factor, but the resolution of in-kernel timestamp sources.
> > If ktime_get_real_ts64 has real ns granularity, then that should be
> > sufficient (at least for the moment). I'm unclear on the performance
> > implications with such a change though.
> 
> I bet you can find some arm board or something with a terrible
> clocksource that will take forever to produce high resolution timestamps
> and get it wrong.
> 
> > You had also mentioned a while back that there was some desire for
> > femtosecond resolution on timestamps. Does that change the calculus here
> > at all? Note that the i_version is not subject to any timestamp
> > granularity issues.
> 
> I personally don't care to go enlarge xfs timestamps even further to
> support sub-ns resolution, but I see the theoretical argument for
> needing them on an 8GHz Intel i9-13900KS(paceheater)...
> 
> > If you want nfsd to start using the ctime for i_version with xfs, then
> > you can just turn off the SB_I_IVERSION flag. You will need to do some
> > work though to keep your "special" i_version that also counts atime
> > updates working once you turn that off. You'll probably want to do that
> > anyway though since the semantics for xfs's version counter are
> > different from everyone else's.
> > 
> > If this is what you choose to do for xfs, then the question becomes: who
> > is going to do that timestamp rework?
> > 
> > Note that there are two other lingering issues with i_version. Neither
> > of these are xfs-specific, but they may inform the changes you want to
> > make there:
> > 
> > 1/ the ctime and i_version can roll backward on a crash.
> > 
> > 2/ the ctime and i_version are both currently updated before write data
> > is copied to the pagecache. It would be ideal if that were done
> > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > for this, but they need a lot more testing.)
> 
> You might also want some means for xfs to tell the vfs that it already
> did the timestamp update (because, say, we had to allocate blocks).
> I wonder what people will say when we have to run a transaction before
> the write to peel off suid bits and another one after to update ctime.
> 

That's a great question! There is a related one too once I started
looking at this in more detail:

Most filesystems end up updating the timestamp via a the call to
file_update_time in __generic_file_write_iter. Today, that's called very
early in the function and if it fails, the write fails without changing
anything.

What do we do now if the write succeeds, but update_time fails? We don't
want to return an error on the write() since the data did get copied in.
Ignoring it seems wrong too though. There could even be some way to
exploit that by changing the contents while holding the timestamp and
version constant.

At this point I'm leaning toward leaving the ctime and i_version to be
updated before the write, and just bumping the i_version a second time
after. In most cases the second bump will end up being a no-op, unless
an i_version query races in between.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-25 17:58       ` Jeff Layton
@ 2023-01-30  2:05         ` Dave Chinner
  2023-01-31 12:02           ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2023-01-30  2:05 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Darrick J. Wong, linux-xfs, linux-fsdevel

On Wed, Jan 25, 2023 at 12:58:08PM -0500, Jeff Layton wrote:
> On Wed, 2023-01-25 at 08:32 -0800, Darrick J. Wong wrote:
> > On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > > Note that there are two other lingering issues with i_version. Neither
> > > of these are xfs-specific, but they may inform the changes you want to
> > > make there:
> > > 
> > > 1/ the ctime and i_version can roll backward on a crash.
> > > 
> > > 2/ the ctime and i_version are both currently updated before write data
> > > is copied to the pagecache. It would be ideal if that were done
> > > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > > for this, but they need a lot more testing.)
> > 
> > You might also want some means for xfs to tell the vfs that it already
> > did the timestamp update (because, say, we had to allocate blocks).
> > I wonder what people will say when we have to run a transaction before
> > the write to peel off suid bits and another one after to update ctime.
> > 
> 
> That's a great question! There is a related one too once I started
> looking at this in more detail:
> 
> Most filesystems end up updating the timestamp via a the call to
> file_update_time in __generic_file_write_iter. Today, that's called very
> early in the function and if it fails, the write fails without changing
> anything.
> 
> What do we do now if the write succeeds, but update_time fails? We don't

On XFS, the timestamp update will either succeed or cause the
filesystem to shutdown as a failure with a dirty transaction is a
fatal, unrecoverable error.

> want to return an error on the write() since the data did get copied in.
> Ignoring it seems wrong too though. There could even be some way to
> exploit that by changing the contents while holding the timestamp and
> version constant.

If the filesystem has shut down, it doesn't matter that the data got
copied into the kernel - it's never going to make it to disk and
attempts to read it back will also fail. There's nothing that can be
exploited by such a failure on XFS - it's game over for everyone
once the fs has shut down....

> At this point I'm leaning toward leaving the ctime and i_version to be
> updated before the write, and just bumping the i_version a second time
> after. In most cases the second bump will end up being a no-op, unless
> an i_version query races in between.

Why not also bump ctime at write completion if a query races with
the write()? Wouldn't that put ns-granularity ctime based change
detection on a par with i_version?

Userspace isn't going to notice the difference - the ctime they
observe indicates that it was changed during the syscall. So
who/what is going to care if we bump ctime twice in the syscall
instead of just once in this rare corner case?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-30  2:05         ` Dave Chinner
@ 2023-01-31 12:02           ` Jeff Layton
  2023-01-31 23:31             ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2023-01-31 12:02 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, linux-fsdevel

On Mon, 2023-01-30 at 13:05 +1100, Dave Chinner wrote:
> On Wed, Jan 25, 2023 at 12:58:08PM -0500, Jeff Layton wrote:
> > On Wed, 2023-01-25 at 08:32 -0800, Darrick J. Wong wrote:
> > > On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > > > Note that there are two other lingering issues with i_version. Neither
> > > > of these are xfs-specific, but they may inform the changes you want to
> > > > make there:
> > > > 
> > > > 1/ the ctime and i_version can roll backward on a crash.
> > > > 
> > > > 2/ the ctime and i_version are both currently updated before write data
> > > > is copied to the pagecache. It would be ideal if that were done
> > > > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > > > for this, but they need a lot more testing.)
> > > 
> > > You might also want some means for xfs to tell the vfs that it already
> > > did the timestamp update (because, say, we had to allocate blocks).
> > > I wonder what people will say when we have to run a transaction before
> > > the write to peel off suid bits and another one after to update ctime.
> > > 
> > 
> > That's a great question! There is a related one too once I started
> > looking at this in more detail:
> > 
> > Most filesystems end up updating the timestamp via a the call to
> > file_update_time in __generic_file_write_iter. Today, that's called very
> > early in the function and if it fails, the write fails without changing
> > anything.
> > 
> > What do we do now if the write succeeds, but update_time fails? We don't
> 
> On XFS, the timestamp update will either succeed or cause the
> filesystem to shutdown as a failure with a dirty transaction is a
> fatal, unrecoverable error.
> 

Ok. So for xfs, we could move all of this to be afterward. Clearing
setuid bits is quite rare, so that would only rarely require a
transaction (in principle).

> > want to return an error on the write() since the data did get copied in.
> > Ignoring it seems wrong too though. There could even be some way to
> > exploit that by changing the contents while holding the timestamp and
> > version constant.
> 
> If the filesystem has shut down, it doesn't matter that the data got
> copied into the kernel - it's never going to make it to disk and
> attempts to read it back will also fail. There's nothing that can be
> exploited by such a failure on XFS - it's game over for everyone
> once the fs has shut down....
> 
> > At this point I'm leaning toward leaving the ctime and i_version to be
> > updated before the write, and just bumping the i_version a second time
> > after. In most cases the second bump will end up being a no-op, unless
> > an i_version query races in between.
> 
> Why not also bump ctime at write completion if a query races with
> the write()? Wouldn't that put ns-granularity ctime based change
> detection on a par with i_version?
> 
> Userspace isn't going to notice the difference - the ctime they
> observe indicates that it was changed during the syscall. So
> who/what is going to care if we bump ctime twice in the syscall
> instead of just once in this rare corner case?
> 

We could bump the ctime too in this situation, but it would be more
costly. In most cases the i_version bump will be a no-op. The only
exception would be when a query of i_version races in between the two
bumps. That wouldn't be the case with the ctime, which would almost
always require a second transaction.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-31 12:02           ` Jeff Layton
@ 2023-01-31 23:31             ` Dave Chinner
  2023-02-01 19:19               ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2023-01-31 23:31 UTC (permalink / raw)
  To: Jeff Layton; +Cc: Darrick J. Wong, linux-xfs, linux-fsdevel

On Tue, Jan 31, 2023 at 07:02:56AM -0500, Jeff Layton wrote:
> On Mon, 2023-01-30 at 13:05 +1100, Dave Chinner wrote:
> > On Wed, Jan 25, 2023 at 12:58:08PM -0500, Jeff Layton wrote:
> > > On Wed, 2023-01-25 at 08:32 -0800, Darrick J. Wong wrote:
> > > > On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > > > > Note that there are two other lingering issues with i_version. Neither
> > > > > of these are xfs-specific, but they may inform the changes you want to
> > > > > make there:
> > > > > 
> > > > > 1/ the ctime and i_version can roll backward on a crash.
> > > > > 
> > > > > 2/ the ctime and i_version are both currently updated before write data
> > > > > is copied to the pagecache. It would be ideal if that were done
> > > > > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > > > > for this, but they need a lot more testing.)
> > > > 
> > > > You might also want some means for xfs to tell the vfs that it already
> > > > did the timestamp update (because, say, we had to allocate blocks).
> > > > I wonder what people will say when we have to run a transaction before
> > > > the write to peel off suid bits and another one after to update ctime.
> > > > 
> > > 
> > > That's a great question! There is a related one too once I started
> > > looking at this in more detail:
> > > 
> > > Most filesystems end up updating the timestamp via a the call to
> > > file_update_time in __generic_file_write_iter. Today, that's called very
> > > early in the function and if it fails, the write fails without changing
> > > anything.
> > > 
> > > What do we do now if the write succeeds, but update_time fails? We don't
> > 
> > On XFS, the timestamp update will either succeed or cause the
> > filesystem to shutdown as a failure with a dirty transaction is a
> > fatal, unrecoverable error.
> > 
> 
> Ok. So for xfs, we could move all of this to be afterward. Clearing
> setuid bits is quite rare, so that would only rarely require a
> transaction (in principle).

See my response in the other email about XFS and atomic buffered
write IO. We don't need to do an update after the write because
reads cannot race between the data copy and the ctime/i_version
update. Hence we only need one update, and it doesn't matter if it
is before or after the data copy into the page cache.

> > > want to return an error on the write() since the data did get copied in.
> > > Ignoring it seems wrong too though. There could even be some way to
> > > exploit that by changing the contents while holding the timestamp and
> > > version constant.
> > 
> > If the filesystem has shut down, it doesn't matter that the data got
> > copied into the kernel - it's never going to make it to disk and
> > attempts to read it back will also fail. There's nothing that can be
> > exploited by such a failure on XFS - it's game over for everyone
> > once the fs has shut down....
> > 
> > > At this point I'm leaning toward leaving the ctime and i_version to be
> > > updated before the write, and just bumping the i_version a second time
> > > after. In most cases the second bump will end up being a no-op, unless
> > > an i_version query races in between.
> > 
> > Why not also bump ctime at write completion if a query races with
> > the write()? Wouldn't that put ns-granularity ctime based change
> > detection on a par with i_version?
> > 
> > Userspace isn't going to notice the difference - the ctime they
> > observe indicates that it was changed during the syscall. So
> > who/what is going to care if we bump ctime twice in the syscall
> > instead of just once in this rare corner case?
> > 
> 
> We could bump the ctime too in this situation, but it would be more
> costly. In most cases the i_version bump will be a no-op. The only
> exception would be when a query of i_version races in between the two
> bumps. That wouldn't be the case with the ctime, which would almost
> always require a second transaction.

You've missed the part where I suggested lifting the "nfsd sampled
i_version" state into an inode state flag rather than hiding it in
the i_version field. At that point, we could optimise away the
secondary ctime updates just like you are proposing we do with the
i_version updates.  Further, we could also use that state it to
decide whether we need to use high resolution timestamps when
recording ctime updates - if the nfsd has not sampled the
ctime/i_version, we don't need high res timestamps to be recorded
for ctime....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-31 23:31             ` Dave Chinner
@ 2023-02-01 19:19               ` Jeff Layton
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2023-02-01 19:19 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs, linux-fsdevel

On Wed, 2023-02-01 at 10:31 +1100, Dave Chinner wrote:
> On Tue, Jan 31, 2023 at 07:02:56AM -0500, Jeff Layton wrote:
> > On Mon, 2023-01-30 at 13:05 +1100, Dave Chinner wrote:
> > > On Wed, Jan 25, 2023 at 12:58:08PM -0500, Jeff Layton wrote:
> > > > On Wed, 2023-01-25 at 08:32 -0800, Darrick J. Wong wrote:
> > > > > On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > > > > > Note that there are two other lingering issues with i_version. Neither
> > > > > > of these are xfs-specific, but they may inform the changes you want to
> > > > > > make there:
> > > > > > 
> > > > > > 1/ the ctime and i_version can roll backward on a crash.
> > > > > > 
> > > > > > 2/ the ctime and i_version are both currently updated before write data
> > > > > > is copied to the pagecache. It would be ideal if that were done
> > > > > > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > > > > > for this, but they need a lot more testing.)
> > > > > 
> > > > > You might also want some means for xfs to tell the vfs that it already
> > > > > did the timestamp update (because, say, we had to allocate blocks).
> > > > > I wonder what people will say when we have to run a transaction before
> > > > > the write to peel off suid bits and another one after to update ctime.
> > > > > 
> > > > 
> > > > That's a great question! There is a related one too once I started
> > > > looking at this in more detail:
> > > > 
> > > > Most filesystems end up updating the timestamp via a the call to
> > > > file_update_time in __generic_file_write_iter. Today, that's called very
> > > > early in the function and if it fails, the write fails without changing
> > > > anything.
> > > > 
> > > > What do we do now if the write succeeds, but update_time fails? We don't
> > > 
> > > On XFS, the timestamp update will either succeed or cause the
> > > filesystem to shutdown as a failure with a dirty transaction is a
> > > fatal, unrecoverable error.
> > > 
> > 
> > Ok. So for xfs, we could move all of this to be afterward. Clearing
> > setuid bits is quite rare, so that would only rarely require a
> > transaction (in principle).
> 
> See my response in the other email about XFS and atomic buffered
> write IO. We don't need to do an update after the write because
> reads cannot race between the data copy and the ctime/i_version
> update. Hence we only need one update, and it doesn't matter if it
> is before or after the data copy into the page cache.
> 

Yep, I just saw that. Makes sense. It sounds like we won't need to do
anything extra for that for XFS at all.

> > > > want to return an error on the write() since the data did get copied in.
> > > > Ignoring it seems wrong too though. There could even be some way to
> > > > exploit that by changing the contents while holding the timestamp and
> > > > version constant.
> > > 
> > > If the filesystem has shut down, it doesn't matter that the data got
> > > copied into the kernel - it's never going to make it to disk and
> > > attempts to read it back will also fail. There's nothing that can be
> > > exploited by such a failure on XFS - it's game over for everyone
> > > once the fs has shut down....
> > > 
> > > > At this point I'm leaning toward leaving the ctime and i_version to be
> > > > updated before the write, and just bumping the i_version a second time
> > > > after. In most cases the second bump will end up being a no-op, unless
> > > > an i_version query races in between.
> > > 
> > > Why not also bump ctime at write completion if a query races with
> > > the write()? Wouldn't that put ns-granularity ctime based change
> > > detection on a par with i_version?
> > > 
> > > Userspace isn't going to notice the difference - the ctime they
> > > observe indicates that it was changed during the syscall. So
> > > who/what is going to care if we bump ctime twice in the syscall
> > > instead of just once in this rare corner case?
> > > 
> > 
> > We could bump the ctime too in this situation, but it would be more
> > costly. In most cases the i_version bump will be a no-op. The only
> > exception would be when a query of i_version races in between the two
> > bumps. That wouldn't be the case with the ctime, which would almost
> > always require a second transaction.
> 
> You've missed the part where I suggested lifting the "nfsd sampled
> i_version" state into an inode state flag rather than hiding it in
> the i_version field. At that point, we could optimise away the
> secondary ctime updates just like you are proposing we do with the
> i_version updates.  Further, we could also use that state it to
> decide whether we need to use high resolution timestamps when
> recording ctime updates - if the nfsd has not sampled the
> ctime/i_version, we don't need high res timestamps to be recorded
> for ctime....

Once you move the flag out of the word, we can no longer do this with
atomic operations and will need to move to locking (probably a
spinlock). Is it worth it? I'm not sure.

It's an interesting proposal, regardless...
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-25 11:47   ` Jeff Layton
  2023-01-25 16:32     ` Darrick J. Wong
@ 2023-01-30  1:54     ` Dave Chinner
  2023-01-31 11:55       ` Jeff Layton
  1 sibling, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2023-01-30  1:54 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-xfs, linux-fsdevel

On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > IIUC the rest of the justification for i_version is that ctime might
> > lack the timestamp granularity to disambiguate sub-timestamp
> > granularity changes, so i_version is needed to bridge that gap.
> > 
> > Given that XFS has nanosecond timestamp resolution in the on-disk
> > format, both i_version and ctime changes are journalled, and
> > ctime/i_version will always change at exactly the same time in the
> > same transactions, there are no inherent sub-timestamp granularity
> > problems with ctime within XFS. Any deficiency in ctime resolution
> > comes solely from the granularity of the VFS inode timestamp
> > functions.
> > 
> > And so if current_time() was to provide fine-grained nanosecond
> > timestamp resolution for exported XFS filesystems (i.e. use
> > ktime_get_real_ts64() conditionally), then it seems to me that the
> > nfsd i_version function becomes completely redundant.
> > 
> > i.e. we are pretty much guaranteed that ctime on exported
> > filesystems will always be different for explicit modifications to
> > the same inode, and hence we can just use ctime as the version
> > change identifier without needing any on-disk format changes at all.
> > 
> > And we can optimise away that overhead when the filesystem is not
> > exported by just using the coarse timestamps because there is no
> > need for sub-timer-tick disambiguation of single file
> > modifications....
> > 
> 
> Ok, so conditional on (maybe) a per fstype flag, and whether the
> filesystem is exported?

Not sure why a per-fstype flag is necessary?

> 
> It's not trivial to tell whether something is exported though. We
> typically only do that sort of checking within nfsd. That involves an
> upcall into mountd, at a minimum.
> 
> I don't think you want to be plumbing calls to exportfs into xfs for
> this. It may be simpler to just add a new on-disk counter and be done
> with it.

I didn't ever expect for XFS to have to be aware of the fact that a
user has exported the filesystem. If "filesystem has been exported"
tracking is required, then we add a flag/counter to the superblock,
and the NFSd subsystem updates the counter/flag when it is informed
that part of the filesystem has been exported/unexported.

The NFSd/export subsystem is pinning filesystems in memory when they
are exported. This is evident by the fact we cannot unmount an
exported filesystem - it has to be unexported before it can be
unmounted. I suspect that it's the ex_path that is stored in the
svc_export structure, because stuff like this is done in the
filehandle code:

static bool is_root_export(struct svc_export *exp)
{
        return exp->ex_path.dentry == exp->ex_path.dentry->d_sb->s_root;
}

static struct super_block *exp_sb(struct svc_export *exp)
{
        return exp->ex_path.dentry->d_sb;
}

i.e. the file handle code assumes the existence of a pinned path
that is the root of the exported directory tree. This points to the
superblock behind the export so that it can do stuff like pull the
device numbers, check sb->s_type->fs_flags fields (e.g
FS_REQUIRES_DEV), etc as needed to encode/decode/verify filehandles.

Somewhere in the code this path must be pinned to the svc_export for
the life of the svc_export (svc_export_init()?), and at that point
the svc_export code could update the struct super_block state
appropriately....

> > Hence it appears to me that with the new i_version specification
> > that there's an avenue out of this problem entirely that is "nfsd
> > needs to use ctime, not i_version". This solution seems generic
> > enough that filesystems with existing on-disk nanosecond timestamp
> > granularity would no longer need explicit on-disk support for the
> > nfsd i_version functionality, yes?
> > 
> 
> Pretty much.
> 
> My understanding has always been that it's not the on-disk format that's
> the limiting factor, but the resolution of in-kernel timestamp sources.
> If ktime_get_real_ts64 has real ns granularity, then that should be
> sufficient (at least for the moment). I'm unclear on the performance
> implications with such a change though.

It's indicated in the documentation:

"These are only useful when called in a fast path and one still
expects better than second accuracy, but can't easily use 'jiffies',
e.g. for inode timestamps.  Skipping the hardware clock access saves
around 100 CPU cycles on most modern machines with a reliable cycle
counter, but up to several microseconds on older hardware with an
external clocksource."

For a modern, high performance machine, 100 CPU cycles for the cycle
counter read is less costly than a pipeline stall due to a single
cacheline miss. For older, slower hardware, the transaction overhead
of a ctime update is already in the order of microseconds, so this
amount of overhead still isn't a show stopper.

As it is, optimising the timestamp read similar to the the iversion
bump only after it has been queried (i.e. equivalent of
inode_maybe_inc_iversion()) would remove most of the additional
overhead for non-NFSd operations. It could be done simply using
an inode state flag rather than hiding the state bit in the
i_version value...

> You had also mentioned a while back that there was some desire for
> femtosecond resolution on timestamps. Does that change the calculus here
> at all? Note that the i_version is not subject to any timestamp
> granularity issues.

[ Reference from 2016 on femptosecond granularity timestamps in
statx():

https://lore.kernel.org/linux-fsdevel/20161117234047.GE28177@dastard/

where I ask:

"So what happens in ten years time when we want to support
femptosecond resolution in the timestamp interface?" ]

Timestamp granularity changes will require an on-disk format change
regardless of anything else. We have no plans to do this again in
the near future - we've just done a revision for >y2038 timestamp
support in the on disk format, and we'd have to do another to
support sub-nanosecond timestamp granularity.  Hence we know exactly
how much time, resources and testing needs to be put into changing
the on-disk timestamp format.  Given that adding a new i_version
field is of similar complexity, we don't want to do either if we can
possibly avoid it.

Looking from a slightly higher perspective, in XFS timestamp updates
are done under exclusive inode locks and so the entire transaction
will need to be done in sub-nanosecond time before we need to worry
about timestamp granularity. It's going to be a long, long time into
the future before that ever happens (if ever!), so I don't think we
need to change the on-disk timestamp granularity to disambiguate
individual inode metadata/data changes any time soon.

> If you want nfsd to start using the ctime for i_version with xfs, then
> you can just turn off the SB_I_IVERSION flag. You will need to do some
> work though to keep your "special" i_version that also counts atime
> updates working once you turn that off. You'll probably want to do that
> anyway though since the semantics for xfs's version counter are
> different from everyone else's.

XFS already uses ->update_time because it is different to other
filesystems in that pure timestamp updates are always updated
transactionally rather than just updating the inode in memory. I'm
not sure there's anything we need to change there right now.

Other things would need to change if we don't set SB_I_IVERSION -
we'd unconditionally bump i_version in xfs_trans_log_inode() rather
than forcing it through inode_maybe_inc_iversion() because we no
longer want the VFS or applications to modify i_version.

But we do still want to tell external parties that i_version is a
usable counter that can be used for data and metadata change
detection, and that stands regardless of the fact that the NFSd
application doesn't want fine-grained change detection anymore.
Hence I think whatever gets done needs to be more nuanced than
"SB_I_VERSION really only means NFSD_CAN_USE_I_VERSION". Perhaps
a second flag that says "SB_I_VERSION_NFSD_COMPATIBLE" to
differentiate between the two cases?

[ Reflection on terminology: how denigrating does it appear when
something is called "special" because it is different to others?
Such terminology would never be allowed if we were talking about
differences between people. ]

> If this is what you choose to do for xfs, then the question becomes: who
> is going to do that timestamp rework?

Depends on what ends up needing to be changed, I'm guessing...

> Note that there are two other lingering issues with i_version. Neither
> of these are xfs-specific, but they may inform the changes you want to
> make there:
> 
> 1/ the ctime and i_version can roll backward on a crash.

Yup, because async transaction engines allow metadata to appear
changed in memory before it is stable on disk. No difference between
i_version or ctime here at all.

> 2/ the ctime and i_version are both currently updated before write data
> is copied to the pagecache. It would be ideal if that were done
> afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> for this, but they need a lot more testing.)

AFAICT, changing the i_version after the page cache has been written
to does not fix the visibility and/or crash ordering issue.  The new
data is published for third party access when the folio the data is
written into is unlocked, not when the entire write operation
completes.  Hence we can start writeback on data that is part of a
write operation before the write operation has completed and updated
i_version.

If we then crash before the write into the page cache completes and
updates i_version, we can now have changed data on disk without a
i_version update in the metadata to even recover from the journal.
Hence with a post-write i_version update, none of the clients will
see that their caches are stale because i_version/ctime hasn't
changed despite the data on disk having changed.

As such, I don't really see what "update i_version after page cache
write completion" actually achieves by itself.  Even "update
i_version before and after write" doesn't entirely close down the
crash hole in i_version, either - it narrows it down, but it does
not remove it...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-30  1:54     ` Dave Chinner
@ 2023-01-31 11:55       ` Jeff Layton
  2023-01-31 23:23         ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Jeff Layton @ 2023-01-31 11:55 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Mon, 2023-01-30 at 12:54 +1100, Dave Chinner wrote:
> On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > > IIUC the rest of the justification for i_version is that ctime might
> > > lack the timestamp granularity to disambiguate sub-timestamp
> > > granularity changes, so i_version is needed to bridge that gap.
> > > 
> > > Given that XFS has nanosecond timestamp resolution in the on-disk
> > > format, both i_version and ctime changes are journalled, and
> > > ctime/i_version will always change at exactly the same time in the
> > > same transactions, there are no inherent sub-timestamp granularity
> > > problems with ctime within XFS. Any deficiency in ctime resolution
> > > comes solely from the granularity of the VFS inode timestamp
> > > functions.
> > > 
> > > And so if current_time() was to provide fine-grained nanosecond
> > > timestamp resolution for exported XFS filesystems (i.e. use
> > > ktime_get_real_ts64() conditionally), then it seems to me that the
> > > nfsd i_version function becomes completely redundant.
> > > 
> > > i.e. we are pretty much guaranteed that ctime on exported
> > > filesystems will always be different for explicit modifications to
> > > the same inode, and hence we can just use ctime as the version
> > > change identifier without needing any on-disk format changes at all.
> > > 
> > > And we can optimise away that overhead when the filesystem is not
> > > exported by just using the coarse timestamps because there is no
> > > need for sub-timer-tick disambiguation of single file
> > > modifications....
> > > 
> > 
> > Ok, so conditional on (maybe) a per fstype flag, and whether the
> > filesystem is exported?
> 
> Not sure why a per-fstype flag is necessary?
> 

I was thinking most filesystems wouldn't need these high-res timestamps,
so we could limit it to those that opt in via a fstype flag.

> > 
> > It's not trivial to tell whether something is exported though. We
> > typically only do that sort of checking within nfsd. That involves an
> > upcall into mountd, at a minimum.
> > 
> > I don't think you want to be plumbing calls to exportfs into xfs for
> > this. It may be simpler to just add a new on-disk counter and be done
> > with it.
> 
> I didn't ever expect for XFS to have to be aware of the fact that a
> user has exported the filesystem. If "filesystem has been exported"
> tracking is required, then we add a flag/counter to the superblock,
> and the NFSd subsystem updates the counter/flag when it is informed
> that part of the filesystem has been exported/unexported.
> 
> The NFSd/export subsystem is pinning filesystems in memory when they
> are exported. This is evident by the fact we cannot unmount an
> exported filesystem - it has to be unexported before it can be
> unmounted. I suspect that it's the ex_path that is stored in the
> svc_export structure, because stuff like this is done in the
> filehandle code:
> 
> static bool is_root_export(struct svc_export *exp)
> {
>         return exp->ex_path.dentry == exp->ex_path.dentry->d_sb->s_root;
> }
> 
> static struct super_block *exp_sb(struct svc_export *exp)
> {
>         return exp->ex_path.dentry->d_sb;
> }
> 
> i.e. the file handle code assumes the existence of a pinned path
> that is the root of the exported directory tree. This points to the
> superblock behind the export so that it can do stuff like pull the
> device numbers, check sb->s_type->fs_flags fields (e.g
> FS_REQUIRES_DEV), etc as needed to encode/decode/verify filehandles.
> 
> Somewhere in the code this path must be pinned to the svc_export for
> the life of the svc_export (svc_export_init()?), and at that point
> the svc_export code could update the struct super_block state
> appropriately....
> 

No, it doesn't quite work like that. The exports table is loaded on-
demand by nfsd via an upcall to mountd.

If you set up a filesystem to be exported by nfsd, but don't do any nfs
activity against it, you'll still be able to unmount the filesystem
because the export entry won't have been loaded by the kernel yet. Once
a client talks to nfsd and touches the export, the kernel will upcall
and that's when the dentry gets pinned.

This is all _really_ old, crusty code, fwiw.

> > > Hence it appears to me that with the new i_version specification
> > > that there's an avenue out of this problem entirely that is "nfsd
> > > needs to use ctime, not i_version". This solution seems generic
> > > enough that filesystems with existing on-disk nanosecond timestamp
> > > granularity would no longer need explicit on-disk support for the
> > > nfsd i_version functionality, yes?
> > > 
> > 
> > Pretty much.
> > 
> > My understanding has always been that it's not the on-disk format that's
> > the limiting factor, but the resolution of in-kernel timestamp sources.
> > If ktime_get_real_ts64 has real ns granularity, then that should be
> > sufficient (at least for the moment). I'm unclear on the performance
> > implications with such a change though.
> 
> It's indicated in the documentation:
> 
> "These are only useful when called in a fast path and one still
> expects better than second accuracy, but can't easily use 'jiffies',
> e.g. for inode timestamps.  Skipping the hardware clock access saves
> around 100 CPU cycles on most modern machines with a reliable cycle
> counter, but up to several microseconds on older hardware with an
> external clocksource."
> 
> For a modern, high performance machine, 100 CPU cycles for the cycle
> counter read is less costly than a pipeline stall due to a single
> cacheline miss. For older, slower hardware, the transaction overhead
> of a ctime update is already in the order of microseconds, so this
> amount of overhead still isn't a show stopper.
> 
> As it is, optimising the timestamp read similar to the the iversion
> bump only after it has been queried (i.e. equivalent of
> inode_maybe_inc_iversion()) would remove most of the additional
> overhead for non-NFSd operations. It could be done simply using
> an inode state flag rather than hiding the state bit in the
> i_version value...
> 
> > You had also mentioned a while back that there was some desire for
> > femtosecond resolution on timestamps. Does that change the calculus here
> > at all? Note that the i_version is not subject to any timestamp
> > granularity issues.
> 
> [ Reference from 2016 on femptosecond granularity timestamps in
> statx():
> 
> https://lore.kernel.org/linux-fsdevel/20161117234047.GE28177@dastard/
> 
> where I ask:
> 
> "So what happens in ten years time when we want to support
> femptosecond resolution in the timestamp interface?" ]
> 
> Timestamp granularity changes will require an on-disk format change
> regardless of anything else. We have no plans to do this again in
> the near future - we've just done a revision for >y2038 timestamp
> support in the on disk format, and we'd have to do another to
> support sub-nanosecond timestamp granularity.  Hence we know exactly
> how much time, resources and testing needs to be put into changing
> the on-disk timestamp format.  Given that adding a new i_version
> field is of similar complexity, we don't want to do either if we can
> possibly avoid it.
> 
> Looking from a slightly higher perspective, in XFS timestamp updates
> are done under exclusive inode locks and so the entire transaction
> will need to be done in sub-nanosecond time before we need to worry
> about timestamp granularity. It's going to be a long, long time into
> the future before that ever happens (if ever!), so I don't think we
> need to change the on-disk timestamp granularity to disambiguate
> individual inode metadata/data changes any time soon.
> 

Fair enough.

> > If you want nfsd to start using the ctime for i_version with xfs, then
> > you can just turn off the SB_I_IVERSION flag. You will need to do some
> > work though to keep your "special" i_version that also counts atime
> > updates working once you turn that off. You'll probably want to do that
> > anyway though since the semantics for xfs's version counter are
> > different from everyone else's.
> 
> XFS already uses ->update_time because it is different to other
> filesystems in that pure timestamp updates are always updated
> transactionally rather than just updating the inode in memory. I'm
> not sure there's anything we need to change there right now.
> 
> Other things would need to change if we don't set SB_I_IVERSION -
> we'd unconditionally bump i_version in xfs_trans_log_inode() rather
> than forcing it through inode_maybe_inc_iversion() because we no
> longer want the VFS or applications to modify i_version.
> 
> But we do still want to tell external parties that i_version is a
> usable counter that can be used for data and metadata change
> detection, and that stands regardless of the fact that the NFSd
> application doesn't want fine-grained change detection anymore.
> Hence I think whatever gets done needs to be more nuanced than
> "SB_I_VERSION really only means NFSD_CAN_USE_I_VERSION". Perhaps
> a second flag that says "SB_I_VERSION_NFSD_COMPATIBLE" to
> differentiate between the two cases?

With the changes I have queued up for v6.3, responsibility for i_version
queries is moved into the filesystems' getattr routines. The kernel does
still use IS_I_VERSION to indicate that the vfs should attempt to bump
the counter, however. Once that goes in, you should be able to turn off
SB_I_VERSION in xfs, and then we won't try to increment the i_version
anymore in the vfs.

If you want nfsd to just use the ctime for the change attribute at that
point, you won't need to do anything further. If you want to have xfs
provide a new one, you can fill out the change_cookie field in struct
kstat.

> 
> [ Reflection on terminology: how denigrating does it appear when
> something is called "special" because it is different to others?
> Such terminology would never be allowed if we were talking about
> differences between people. ]
> 

True, but XFS isn't a person. I wasn't trying to be denigrating here,
just pointing out that XFS has special requirements vs. other
filesystems.

> > If this is what you choose to do for xfs, then the question becomes: who
> > is going to do that timestamp rework?
> 
> Depends on what ends up needing to be changed, I'm guessing...
> 
> > Note that there are two other lingering issues with i_version. Neither
> > of these are xfs-specific, but they may inform the changes you want to
> > make there:
> > 
> > 1/ the ctime and i_version can roll backward on a crash.
> 
> Yup, because async transaction engines allow metadata to appear
> changed in memory before it is stable on disk. No difference between
> i_version or ctime here at all.
> 

There is a little difference. The big danger here is that the i_version
could roll backward after being seen by a client, and then on a
subsequent change, we'd end up with a duplicate i_version that reflects
a different state of the file from what the client has observed.

ctime is probably more resilient here, as to get the same effect you'd
need to crash and roll back, _and_ have the clock roll backward too.

> > 2/ the ctime and i_version are both currently updated before write data
> > is copied to the pagecache. It would be ideal if that were done
> > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > for this, but they need a lot more testing.)
> 
> AFAICT, changing the i_version after the page cache has been written
> to does not fix the visibility and/or crash ordering issue.  The new
> data is published for third party access when the folio the data is
> written into is unlocked, not when the entire write operation
> completes.  Hence we can start writeback on data that is part of a
> write operation before the write operation has completed and updated
> i_version.
> 
> If we then crash before the write into the page cache completes and
> updates i_version, we can now have changed data on disk without a
> i_version update in the metadata to even recover from the journal.
> Hence with a post-write i_version update, none of the clients will
> see that their caches are stale because i_version/ctime hasn't
> changed despite the data on disk having changed.
> 
> As such, I don't really see what "update i_version after page cache
> write completion" actually achieves by itself.  Even "update
> i_version before and after write" doesn't entirely close down the
> crash hole in i_version, either - it narrows it down, but it does
> not remove it...
> 

This issue is not about crash resilience.

The worry we have is that a client could see a change attribute change
and do a read of the data before the writer has copied the data to the
pagecache. If it does that and the i_version doesn't change again, then
the client will be stuck with stale data in its cache.

By bumping the times and/or version after the change, we ensure that
this doesn't happen. The client might briefly associate the wrong data
with a particular change attr, but that situation should be short-lived.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-31 11:55       ` Jeff Layton
@ 2023-01-31 23:23         ` Dave Chinner
  2023-02-01 19:11           ` Jeff Layton
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2023-01-31 23:23 UTC (permalink / raw)
  To: Jeff Layton; +Cc: linux-xfs, linux-fsdevel

On Tue, Jan 31, 2023 at 06:55:41AM -0500, Jeff Layton wrote:
> On Mon, 2023-01-30 at 12:54 +1100, Dave Chinner wrote:
> > On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > > On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > > > IIUC the rest of the justification for i_version is that ctime might
> > > > lack the timestamp granularity to disambiguate sub-timestamp
> > > > granularity changes, so i_version is needed to bridge that gap.
> > > > 
> > > > Given that XFS has nanosecond timestamp resolution in the on-disk
> > > > format, both i_version and ctime changes are journalled, and
> > > > ctime/i_version will always change at exactly the same time in the
> > > > same transactions, there are no inherent sub-timestamp granularity
> > > > problems with ctime within XFS. Any deficiency in ctime resolution
> > > > comes solely from the granularity of the VFS inode timestamp
> > > > functions.
> > > > 
> > > > And so if current_time() was to provide fine-grained nanosecond
> > > > timestamp resolution for exported XFS filesystems (i.e. use
> > > > ktime_get_real_ts64() conditionally), then it seems to me that the
> > > > nfsd i_version function becomes completely redundant.
> > > > 
> > > > i.e. we are pretty much guaranteed that ctime on exported
> > > > filesystems will always be different for explicit modifications to
> > > > the same inode, and hence we can just use ctime as the version
> > > > change identifier without needing any on-disk format changes at all.
> > > > 
> > > > And we can optimise away that overhead when the filesystem is not
> > > > exported by just using the coarse timestamps because there is no
> > > > need for sub-timer-tick disambiguation of single file
> > > > modifications....
> > > > 
> > > 
> > > Ok, so conditional on (maybe) a per fstype flag, and whether the
> > > filesystem is exported?
> > 
> > Not sure why a per-fstype flag is necessary?
> > 
> 
> I was thinking most filesystems wouldn't need these high-res timestamps,
> so we could limit it to those that opt in via a fstype flag.

We'd only need high-res timestamps when NFS is in use, not all the
time. We already have timestamp granularity information in the
superblock, so if the filesystem is exported and the sb indicates
that it has sub-jiffie timestamp resolution, we can then use
high-res timestamps for ctime and the need for i_version goes away
completely....

> > > It's not trivial to tell whether something is exported though. We
> > > typically only do that sort of checking within nfsd. That involves an
> > > upcall into mountd, at a minimum.
> > > 
> > > I don't think you want to be plumbing calls to exportfs into xfs for
> > > this. It may be simpler to just add a new on-disk counter and be done
> > > with it.
> > 
> > I didn't ever expect for XFS to have to be aware of the fact that a
> > user has exported the filesystem. If "filesystem has been exported"
> > tracking is required, then we add a flag/counter to the superblock,
> > and the NFSd subsystem updates the counter/flag when it is informed
> > that part of the filesystem has been exported/unexported.
> > 
> > The NFSd/export subsystem is pinning filesystems in memory when they
> > are exported. This is evident by the fact we cannot unmount an
> > exported filesystem - it has to be unexported before it can be
> > unmounted. I suspect that it's the ex_path that is stored in the
> > svc_export structure, because stuff like this is done in the
> > filehandle code:
> > 
> > static bool is_root_export(struct svc_export *exp)
> > {
> >         return exp->ex_path.dentry == exp->ex_path.dentry->d_sb->s_root;
> > }
> > 
> > static struct super_block *exp_sb(struct svc_export *exp)
> > {
> >         return exp->ex_path.dentry->d_sb;
> > }
> > 
> > i.e. the file handle code assumes the existence of a pinned path
> > that is the root of the exported directory tree. This points to the
> > superblock behind the export so that it can do stuff like pull the
> > device numbers, check sb->s_type->fs_flags fields (e.g
> > FS_REQUIRES_DEV), etc as needed to encode/decode/verify filehandles.
> > 
> > Somewhere in the code this path must be pinned to the svc_export for
> > the life of the svc_export (svc_export_init()?), and at that point
> > the svc_export code could update the struct super_block state
> > appropriately....
> > 
> 
> No, it doesn't quite work like that. The exports table is loaded on-
> demand by nfsd via an upcall to mountd.
> 
> If you set up a filesystem to be exported by nfsd, but don't do any nfs
> activity against it, you'll still be able to unmount the filesystem
> because the export entry won't have been loaded by the kernel yet. Once
> a client talks to nfsd and touches the export, the kernel will upcall
> and that's when the dentry gets pinned.

I think you've missed the forest for the trees. Yes, the pinning
mechanism works slightly differently to what I described, but the
key take-away is that there is a *reliable mechanism* that tracks
when a filesystem is effectively exported by the NFSd and high res
timestamps would be required.

Hence it's still trivial for those triggers to mark the sb with
"export pinned" state (e.g. a simple counter) for other code to vary
their algorithms based on whether the filesystem is being actively
accessed by remote NFS clients or not.

> > > If this is what you choose to do for xfs, then the question becomes: who
> > > is going to do that timestamp rework?
> > 
> > Depends on what ends up needing to be changed, I'm guessing...
> > 
> > > Note that there are two other lingering issues with i_version. Neither
> > > of these are xfs-specific, but they may inform the changes you want to
> > > make there:
> > > 
> > > 1/ the ctime and i_version can roll backward on a crash.
> > 
> > Yup, because async transaction engines allow metadata to appear
> > changed in memory before it is stable on disk. No difference between
> > i_version or ctime here at all.
> > 
> 
> There is a little difference. The big danger here is that the i_version
> could roll backward after being seen by a client, and then on a
> subsequent change, we'd end up with a duplicate i_version that reflects
> a different state of the file from what the client has observed.
> 
> ctime is probably more resilient here, as to get the same effect you'd
> need to crash and roll back, _and_ have the clock roll backward too.
> 
> > > 2/ the ctime and i_version are both currently updated before write data
> > > is copied to the pagecache. It would be ideal if that were done
> > > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > > for this, but they need a lot more testing.)
> > 
> > AFAICT, changing the i_version after the page cache has been written
> > to does not fix the visibility and/or crash ordering issue.  The new
> > data is published for third party access when the folio the data is
> > written into is unlocked, not when the entire write operation
> > completes.  Hence we can start writeback on data that is part of a
> > write operation before the write operation has completed and updated
> > i_version.
> > 
> > If we then crash before the write into the page cache completes and
> > updates i_version, we can now have changed data on disk without a
> > i_version update in the metadata to even recover from the journal.
> > Hence with a post-write i_version update, none of the clients will
> > see that their caches are stale because i_version/ctime hasn't
> > changed despite the data on disk having changed.
> > 
> > As such, I don't really see what "update i_version after page cache
> > write completion" actually achieves by itself.  Even "update
> > i_version before and after write" doesn't entirely close down the
> > crash hole in i_version, either - it narrows it down, but it does
> > not remove it...
> > 
> 
> This issue is not about crash resilience.
> 
> The worry we have is that a client could see a change attribute change
> and do a read of the data before the writer has copied the data to the
> pagecache. If it does that and the i_version doesn't change again, then
> the client will be stuck with stale data in its cache.

That can't happen with XFS. Buffered reads cannot run concurrently
with buffered writes because we serialise entire write IO requests
against entire read IO requests via the i_rwsem. i.e. buffered reads
use shared locking, buffered writes use exclusive locking.

As a result, the kiocb_modified() call to change i_version/ctime in
the write path is done while we hold the i_rwsem exclusively. i.e.
the version change is done while concurrent readers are blocked
waiting for the write to complete the data copy into the page cache.
Hence there is no transient state between the i_version/ctime update
and the data copy-in completion that buffered readers can observe.

Yes, this is another aspect where XFS is different to every other
Linux filesystem - none of the other filesystems have this "atomic"
buffered write IO path. Instead, they use folio locks to serialise
page cache access and so reads can run concurrently with write
operations as long as they aren't trying to access the same folio at
the same time. Hence you can get concurrent overlapping reads and
writes and this is the root cause of all the i_version/ctime update
problems you are talking about.

> By bumping the times and/or version after the change, we ensure that
> this doesn't happen. The client might briefly associate the wrong data
> with a particular change attr, but that situation should be short-lived.

Unless the server crashes mid-write....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: replacement i_version counter for xfs
  2023-01-31 23:23         ` Dave Chinner
@ 2023-02-01 19:11           ` Jeff Layton
  0 siblings, 0 replies; 13+ messages in thread
From: Jeff Layton @ 2023-02-01 19:11 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, linux-fsdevel

On Wed, 2023-02-01 at 10:23 +1100, Dave Chinner wrote:
> On Tue, Jan 31, 2023 at 06:55:41AM -0500, Jeff Layton wrote:
> > On Mon, 2023-01-30 at 12:54 +1100, Dave Chinner wrote:
> > > On Wed, Jan 25, 2023 at 06:47:12AM -0500, Jeff Layton wrote:
> > > > On Wed, 2023-01-25 at 11:02 +1100, Dave Chinner wrote:
> > > > > IIUC the rest of the justification for i_version is that ctime might
> > > > > lack the timestamp granularity to disambiguate sub-timestamp
> > > > > granularity changes, so i_version is needed to bridge that gap.
> > > > > 
> > > > > Given that XFS has nanosecond timestamp resolution in the on-disk
> > > > > format, both i_version and ctime changes are journalled, and
> > > > > ctime/i_version will always change at exactly the same time in the
> > > > > same transactions, there are no inherent sub-timestamp granularity
> > > > > problems with ctime within XFS. Any deficiency in ctime resolution
> > > > > comes solely from the granularity of the VFS inode timestamp
> > > > > functions.
> > > > > 
> > > > > And so if current_time() was to provide fine-grained nanosecond
> > > > > timestamp resolution for exported XFS filesystems (i.e. use
> > > > > ktime_get_real_ts64() conditionally), then it seems to me that the
> > > > > nfsd i_version function becomes completely redundant.
> > > > > 
> > > > > i.e. we are pretty much guaranteed that ctime on exported
> > > > > filesystems will always be different for explicit modifications to
> > > > > the same inode, and hence we can just use ctime as the version
> > > > > change identifier without needing any on-disk format changes at all.
> > > > > 
> > > > > And we can optimise away that overhead when the filesystem is not
> > > > > exported by just using the coarse timestamps because there is no
> > > > > need for sub-timer-tick disambiguation of single file
> > > > > modifications....
> > > > > 
> > > > 
> > > > Ok, so conditional on (maybe) a per fstype flag, and whether the
> > > > filesystem is exported?
> > > 
> > > Not sure why a per-fstype flag is necessary?
> > > 
> > 
> > I was thinking most filesystems wouldn't need these high-res timestamps,
> > so we could limit it to those that opt in via a fstype flag.
> 
> We'd only need high-res timestamps when NFS is in use, not all the
> time. We already have timestamp granularity information in the
> superblock, so if the filesystem is exported and the sb indicates
> that it has sub-jiffie timestamp resolution, we can then use
> high-res timestamps for ctime and the need for i_version goes away
> completely....
> 
> > > > It's not trivial to tell whether something is exported though. We
> > > > typically only do that sort of checking within nfsd. That involves an
> > > > upcall into mountd, at a minimum.
> > > > 
> > > > I don't think you want to be plumbing calls to exportfs into xfs for
> > > > this. It may be simpler to just add a new on-disk counter and be done
> > > > with it.
> > > 
> > > I didn't ever expect for XFS to have to be aware of the fact that a
> > > user has exported the filesystem. If "filesystem has been exported"
> > > tracking is required, then we add a flag/counter to the superblock,
> > > and the NFSd subsystem updates the counter/flag when it is informed
> > > that part of the filesystem has been exported/unexported.
> > > 
> > > The NFSd/export subsystem is pinning filesystems in memory when they
> > > are exported. This is evident by the fact we cannot unmount an
> > > exported filesystem - it has to be unexported before it can be
> > > unmounted. I suspect that it's the ex_path that is stored in the
> > > svc_export structure, because stuff like this is done in the
> > > filehandle code:
> > > 
> > > static bool is_root_export(struct svc_export *exp)
> > > {
> > >         return exp->ex_path.dentry == exp->ex_path.dentry->d_sb->s_root;
> > > }
> > > 
> > > static struct super_block *exp_sb(struct svc_export *exp)
> > > {
> > >         return exp->ex_path.dentry->d_sb;
> > > }
> > > 
> > > i.e. the file handle code assumes the existence of a pinned path
> > > that is the root of the exported directory tree. This points to the
> > > superblock behind the export so that it can do stuff like pull the
> > > device numbers, check sb->s_type->fs_flags fields (e.g
> > > FS_REQUIRES_DEV), etc as needed to encode/decode/verify filehandles.
> > > 
> > > Somewhere in the code this path must be pinned to the svc_export for
> > > the life of the svc_export (svc_export_init()?), and at that point
> > > the svc_export code could update the struct super_block state
> > > appropriately....
> > > 
> > 
> > No, it doesn't quite work like that. The exports table is loaded on-
> > demand by nfsd via an upcall to mountd.
> > 
> > If you set up a filesystem to be exported by nfsd, but don't do any nfs
> > activity against it, you'll still be able to unmount the filesystem
> > because the export entry won't have been loaded by the kernel yet. Once
> > a client talks to nfsd and touches the export, the kernel will upcall
> > and that's when the dentry gets pinned.
> 
> I think you've missed the forest for the trees. Yes, the pinning
> mechanism works slightly differently to what I described, but the
> key take-away is that there is a *reliable mechanism* that tracks
> when a filesystem is effectively exported by the NFSd and high res
> timestamps would be required.
> 

No, there really isn't, unless you're willing to upcall for that info.

> Hence it's still trivial for those triggers to mark the sb with
> "export pinned" state (e.g. a simple counter) for other code to vary
> their algorithms based on whether the filesystem is being actively
> accessed by remote NFS clients or not.
> 

All of this happens _on_ _demand_ when there are requests to the NFS
server. The kernel has no idea whether a particular subtree is exported
until it upcalls and talks to mountd. If you want to base any behavior
in the filesystem on whether it's exported, you need do that upcall
first.

Bear in mind that I say _dentry_ here and not filesystem. Exports are
not necessarily done on filesystem boundaries. If we don't know whether
the parent directory of a dentry is exported, then we have to look it up
in the exports table, and we may have to perform an upcall, depending on
what's currently cached there.

IOW, basing any behavior on whether a filesystem is exported is a much
more difficult proposition than what you're suggesting above. We have to
do that to handle NFS RPCs, but I doubt local users would be as happy
with that.

> > > > If this is what you choose to do for xfs, then the question becomes: who
> > > > is going to do that timestamp rework?
> > > 
> > > Depends on what ends up needing to be changed, I'm guessing...
> > > 
> > > > Note that there are two other lingering issues with i_version. Neither
> > > > of these are xfs-specific, but they may inform the changes you want to
> > > > make there:
> > > > 
> > > > 1/ the ctime and i_version can roll backward on a crash.
> > > 
> > > Yup, because async transaction engines allow metadata to appear
> > > changed in memory before it is stable on disk. No difference between
> > > i_version or ctime here at all.
> > > 
> > 
> > There is a little difference. The big danger here is that the i_version
> > could roll backward after being seen by a client, and then on a
> > subsequent change, we'd end up with a duplicate i_version that reflects
> > a different state of the file from what the client has observed.
> > 
> > ctime is probably more resilient here, as to get the same effect you'd
> > need to crash and roll back, _and_ have the clock roll backward too.
> > 
> > > > 2/ the ctime and i_version are both currently updated before write data
> > > > is copied to the pagecache. It would be ideal if that were done
> > > > afterward instead. (FWIW, I have some draft patches for btrfs and ext4
> > > > for this, but they need a lot more testing.)
> > > 
> > > AFAICT, changing the i_version after the page cache has been written
> > > to does not fix the visibility and/or crash ordering issue.  The new
> > > data is published for third party access when the folio the data is
> > > written into is unlocked, not when the entire write operation
> > > completes.  Hence we can start writeback on data that is part of a
> > > write operation before the write operation has completed and updated
> > > i_version.
> > > 
> > > If we then crash before the write into the page cache completes and
> > > updates i_version, we can now have changed data on disk without a
> > > i_version update in the metadata to even recover from the journal.
> > > Hence with a post-write i_version update, none of the clients will
> > > see that their caches are stale because i_version/ctime hasn't
> > > changed despite the data on disk having changed.
> > > 
> > > As such, I don't really see what "update i_version after page cache
> > > write completion" actually achieves by itself.  Even "update
> > > i_version before and after write" doesn't entirely close down the
> > > crash hole in i_version, either - it narrows it down, but it does
> > > not remove it...
> > > 
> > 
> > This issue is not about crash resilience.
> > 
> > The worry we have is that a client could see a change attribute change
> > and do a read of the data before the writer has copied the data to the
> > pagecache. If it does that and the i_version doesn't change again, then
> > the client will be stuck with stale data in its cache.
> 
> That can't happen with XFS. Buffered reads cannot run concurrently
> with buffered writes because we serialise entire write IO requests
> against entire read IO requests via the i_rwsem. i.e. buffered reads
> use shared locking, buffered writes use exclusive locking.
> 
> As a result, the kiocb_modified() call to change i_version/ctime in
> the write path is done while we hold the i_rwsem exclusively. i.e.
> the version change is done while concurrent readers are blocked
> waiting for the write to complete the data copy into the page cache.
> Hence there is no transient state between the i_version/ctime update
> and the data copy-in completion that buffered readers can observe.
> 
> Yes, this is another aspect where XFS is different to every other
> Linux filesystem - none of the other filesystems have this "atomic"
> buffered write IO path. Instead, they use folio locks to serialise
> page cache access and so reads can run concurrently with write
> operations as long as they aren't trying to access the same folio at
> the same time. Hence you can get concurrent overlapping reads and
> writes and this is the root cause of all the i_version/ctime update
> problems you are talking about.
> 

That's good to know. So we don't really need to do this second bump on
xfs at all.

> > By bumping the times and/or version after the change, we ensure that
> > this doesn't happen. The client might briefly associate the wrong data
> > with a particular change attr, but that situation should be short-lived.
> 
> Unless the server crashes mid-write....

Right, but then we're back to problem #1 above.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2023-02-01 19:19 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-01-24 12:56 replacement i_version counter for xfs Jeff Layton
2023-01-25  0:02 ` Dave Chinner
2023-01-25 11:47   ` Jeff Layton
2023-01-25 16:32     ` Darrick J. Wong
2023-01-25 17:58       ` Jeff Layton
2023-01-30  2:05         ` Dave Chinner
2023-01-31 12:02           ` Jeff Layton
2023-01-31 23:31             ` Dave Chinner
2023-02-01 19:19               ` Jeff Layton
2023-01-30  1:54     ` Dave Chinner
2023-01-31 11:55       ` Jeff Layton
2023-01-31 23:23         ` Dave Chinner
2023-02-01 19:11           ` Jeff Layton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).