[QUESTION] Upgrade xfs filesystem to reflink support?

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [QUESTION] Upgrade xfs filesystem to reflink support?
@ 2022-05-09  7:50 Amir Goldstein
  2022-05-09 18:20 ` Darrick J. Wong
  0 siblings, 1 reply; 13+ messages in thread
From: Amir Goldstein @ 2022-05-09  7:50 UTC (permalink / raw)
  To: Darrick J. Wong, Dave Chinner; +Cc: linux-xfs

Hi Darrick and Dave,

I might have asked this back when reflink was introduced, but cannot
find the question nor answer.

Is there any a priori NACK or exceptional challenges w.r.t implementing
upgrade of xfs to reflink support?

We have several customers with xfs formatted pre reflink that we would
like to consider
upgrading.

Back in the time of reflink circa v4.9 there were few xfs features
that could be
upgraded, but nowadays, there are several features that could be upgraded.

If I am not mistaken, the target audience for this upgrade would be
xfs formatted
with xfsprogs 4.17 (defaults).
I realize that journal size may have been smaller at that time (I need to check)
which may be a source of additional problems, but hopefully, some of your work
to do a diet for journal credits for reflink could perhaps mitigate
that issue(?).

Shall I take a swing at it?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-09  7:50 [QUESTION] Upgrade xfs filesystem to reflink support? Amir Goldstein
@ 2022-05-09 18:20 ` Darrick J. Wong
  2022-05-10  6:21   ` Amir Goldstein
  2022-05-10  9:29   ` Christoph Hellwig
  0 siblings, 2 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-05-09 18:20 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs

[drop my oracle email from cc, outlook sux]

On Mon, May 09, 2022 at 10:50:20AM +0300, Amir Goldstein wrote:
> Hi Darrick and Dave,
> 
> I might have asked this back when reflink was introduced, but cannot
> find the question nor answer.
> 
> Is there any a priori NACK or exceptional challenges w.r.t implementing
> upgrade of xfs to reflink support?

No, just lack of immediate user demand + time to develop and merge code
+ time to QA the whole mess to make sure it doesn't introduce any
messes.

> We have several customers with xfs formatted pre reflink that we would
> like to consider
> upgrading.
> 
> Back in the time of reflink circa v4.9 there were few xfs features
> that could be
> upgraded, but nowadays, there are several features that could be upgraded.
> 
> If I am not mistaken, the target audience for this upgrade would be
> xfs formatted
> with xfsprogs 4.17 (defaults).
> I realize that journal size may have been smaller at that time (I need to check)
> which may be a source of additional problems,

Yes.  We've found in practice that logsize < 100MB produce serious
scalability problems and increase deadlock opportunities on such old
kernels.  The 64MB floor we just put in for xfsprogs 5.15 was a good
enough downwards estimate assuming that most people will end up on 5.19+
kernels in the (very) long run.

> but hopefully, some of your work
> to do a diet for journal credits for reflink could perhaps mitigate
> that issue(?).

That work reduces the internal transaction size but leaves the existing
minimum log size standards intact.

> Shall I take a swing at it?

It's already written:
https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=upgrade-older-features

I think the upcoming nrext64 xfsprogs patches took in the first patch in
that series.

Question: Now that mkfs has a min logsize of 64MB, should we refuse
upgrades for any filesystem with logsize < 64MB?

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-09 18:20 ` Darrick J. Wong
@ 2022-05-10  6:21   ` Amir Goldstein
  2022-05-10 19:02     ` Darrick J. Wong
  2022-05-10  9:29   ` Christoph Hellwig
  1 sibling, 1 reply; 13+ messages in thread
From: Amir Goldstein @ 2022-05-10  6:21 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs, Theodore Tso

On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> [drop my oracle email from cc, outlook sux]
>
> On Mon, May 09, 2022 at 10:50:20AM +0300, Amir Goldstein wrote:
> > Hi Darrick and Dave,
> >
> > I might have asked this back when reflink was introduced, but cannot
> > find the question nor answer.
> >
> > Is there any a priori NACK or exceptional challenges w.r.t implementing
> > upgrade of xfs to reflink support?
>
> No, just lack of immediate user demand + time to develop and merge code
> + time to QA the whole mess to make sure it doesn't introduce any
> messes.
>

I can certainly help with QA the upgrade scenarios.

> > We have several customers with xfs formatted pre reflink that we would
> > like to consider
> > upgrading.
> >
> > Back in the time of reflink circa v4.9 there were few xfs features
> > that could be
> > upgraded, but nowadays, there are several features that could be upgraded.
> >
> > If I am not mistaken, the target audience for this upgrade would be
> > xfs formatted
> > with xfsprogs 4.17 (defaults).
> > I realize that journal size may have been smaller at that time (I need to check)
> > which may be a source of additional problems,
>
> Yes.  We've found in practice that logsize < 100MB produce serious
> scalability problems and increase deadlock opportunities on such old
> kernels.  The 64MB floor we just put in for xfsprogs 5.15 was a good
> enough downwards estimate assuming that most people will end up on 5.19+
> kernels in the (very) long run.
>
> > but hopefully, some of your work
> > to do a diet for journal credits for reflink could perhaps mitigate
> > that issue(?).
>
> That work reduces the internal transaction size but leaves the existing
> minimum log size standards intact.
>
> > Shall I take a swing at it?
>
> It's already written:
> https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=upgrade-older-features
>

How convenient :)
I can start testing already.

> I think the upcoming nrext64 xfsprogs patches took in the first patch in
> that series.
>
> Question: Now that mkfs has a min logsize of 64MB, should we refuse
> upgrades for any filesystem with logsize < 64MB?

I think that would make a lot of sense. We do need to reduce the upgrade
test matrix as much as we can, at least as a starting point.
Our customers would have started with at least 1TB fs, so should not
have a problem with minimum logsize on upgrade.

BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
practice of users to start with a small fs and grow it, which is encouraged by
Cloud providers pricing model.

I had asked Ted about the option to resize the ext4 journal and he replied
that in theory it could be done, because the ext4 journal does not need to be
contiguous. He thought that it was not the case for XFS though.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-10  6:21   ` Amir Goldstein
@ 2022-05-10 19:02     ` Darrick J. Wong
  2022-05-10 22:05       ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Darrick J. Wong @ 2022-05-10 19:02 UTC (permalink / raw)
  To: Amir Goldstein; +Cc: Dave Chinner, linux-xfs, Theodore Tso

On Tue, May 10, 2022 at 09:21:03AM +0300, Amir Goldstein wrote:
> On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > [drop my oracle email from cc, outlook sux]
> >
> > On Mon, May 09, 2022 at 10:50:20AM +0300, Amir Goldstein wrote:
> > > Hi Darrick and Dave,
> > >
> > > I might have asked this back when reflink was introduced, but cannot
> > > find the question nor answer.
> > >
> > > Is there any a priori NACK or exceptional challenges w.r.t implementing
> > > upgrade of xfs to reflink support?
> >
> > No, just lack of immediate user demand + time to develop and merge code
> > + time to QA the whole mess to make sure it doesn't introduce any
> > messes.
> >
> 
> I can certainly help with QA the upgrade scenarios.
> 
> > > We have several customers with xfs formatted pre reflink that we would
> > > like to consider
> > > upgrading.
> > >
> > > Back in the time of reflink circa v4.9 there were few xfs features
> > > that could be
> > > upgraded, but nowadays, there are several features that could be upgraded.
> > >
> > > If I am not mistaken, the target audience for this upgrade would be
> > > xfs formatted
> > > with xfsprogs 4.17 (defaults).
> > > I realize that journal size may have been smaller at that time (I need to check)
> > > which may be a source of additional problems,
> >
> > Yes.  We've found in practice that logsize < 100MB produce serious
> > scalability problems and increase deadlock opportunities on such old
> > kernels.  The 64MB floor we just put in for xfsprogs 5.15 was a good
> > enough downwards estimate assuming that most people will end up on 5.19+
> > kernels in the (very) long run.
> >
> > > but hopefully, some of your work
> > > to do a diet for journal credits for reflink could perhaps mitigate
> > > that issue(?).
> >
> > That work reduces the internal transaction size but leaves the existing
> > minimum log size standards intact.
> >
> > > Shall I take a swing at it?
> >
> > It's already written:
> > https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=upgrade-older-features
> >
> 
> How convenient :)
> I can start testing already.
> 
> > I think the upcoming nrext64 xfsprogs patches took in the first patch in
> > that series.
> >
> > Question: Now that mkfs has a min logsize of 64MB, should we refuse
> > upgrades for any filesystem with logsize < 64MB?
> 
> I think that would make a lot of sense. We do need to reduce the upgrade
> test matrix as much as we can, at least as a starting point.
> Our customers would have started with at least 1TB fs, so should not
> have a problem with minimum logsize on upgrade.
> 
> BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
> practice of users to start with a small fs and grow it, which is encouraged by
> Cloud providers pricing model.
> 
> I had asked Ted about the option to resize the ext4 journal and he replied
> that in theory it could be done, because the ext4 journal does not need to be
> contiguous. He thought that it was not the case for XFS though.

It's theoretically possible, but I'd bet that making it work reliably
will be difficult for an infrequent operation.  The old log would probably
have to clean itself, and then write a single transaction containing
both the bnobt update to allocate the new log as well as an EFI to erase
it.  Then you write to the new log a single transaction containing the
superblock and an EFI to free the old log.  Then you update the primary
super and force it out to disk, un-quiesce the log, and finish that EFI
so that the old log gets freed.

And then you have to go back and find the necessary parts that I missed.

--D

> Thanks,
> Amir.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-10 19:02     ` Darrick J. Wong
@ 2022-05-10 22:05       ` Dave Chinner
  2022-05-11 15:46         ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2022-05-10 22:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, linux-xfs, Theodore Tso

On Tue, May 10, 2022 at 12:02:12PM -0700, Darrick J. Wong wrote:
> On Tue, May 10, 2022 at 09:21:03AM +0300, Amir Goldstein wrote:
> > On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > I think the upcoming nrext64 xfsprogs patches took in the first patch in
> > > that series.
> > >
> > > Question: Now that mkfs has a min logsize of 64MB, should we refuse
> > > upgrades for any filesystem with logsize < 64MB?
> > 
> > I think that would make a lot of sense. We do need to reduce the upgrade
> > test matrix as much as we can, at least as a starting point.
> > Our customers would have started with at least 1TB fs, so should not
> > have a problem with minimum logsize on upgrade.
> > 
> > BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
> > practice of users to start with a small fs and grow it, which is encouraged by
> > Cloud providers pricing model.
> > 
> > I had asked Ted about the option to resize the ext4 journal and he replied
> > that in theory it could be done, because the ext4 journal does not need to be
> > contiguous. He thought that it was not the case for XFS though.
> 
> It's theoretically possible, but I'd bet that making it work reliably
> will be difficult for an infrequent operation.  The old log would probably
> have to clean itself, and then write a single transaction containing
> both the bnobt update to allocate the new log as well as an EFI to erase
> it.  Then you write to the new log a single transaction containing the
> superblock and an EFI to free the old log.  Then you update the primary
> super and force it out to disk, un-quiesce the log, and finish that EFI
> so that the old log gets freed.
> 
> And then you have to go back and find the necessary parts that I missed.

The new log transaction to say "the new log is over there" so log
recovery knows that the old log is being replaced and can go find
the new log and recover it to free the old log.

IOWs, there's a heap of log recovery work needed, a new
intent/transaction type, futzing with feature bits because old
kernels won't be able to recovery such a operation, etc.

Then there's interesting issues that haven't ever been considered,
like having a discontiguity in the LSN as we physically switch logs.
What cycle number does the new log start at? What happens to all the
head and tail tracking fields when we switch to the new log? What
about all the log items in the AIL which is ordered by LSN? What
about all the active log items that track a specific LSN for
recovery integrity purposes (e.g. inode allocation buffers)? What
about updating the reservation grant heads that track log space
usage? Updating all the static size calculations used by the log
code which has to be done before the new log can be written to via
iclogs.

The allocation of the new log extent and the freeing of the old log
extent is the easy bit. Handling the failure cases to provide an
atomic, always recoverable switch and managing all the runtime state
and accounting changes that are necessary is the hard part...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-10 22:05       ` Dave Chinner
@ 2022-05-11 15:46         ` Brian Foster
  2022-05-11 22:24           ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2022-05-11 15:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs, Theodore Tso

On Wed, May 11, 2022 at 08:05:23AM +1000, Dave Chinner wrote:
> On Tue, May 10, 2022 at 12:02:12PM -0700, Darrick J. Wong wrote:
> > On Tue, May 10, 2022 at 09:21:03AM +0300, Amir Goldstein wrote:
> > > On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > I think the upcoming nrext64 xfsprogs patches took in the first patch in
> > > > that series.
> > > >
> > > > Question: Now that mkfs has a min logsize of 64MB, should we refuse
> > > > upgrades for any filesystem with logsize < 64MB?
> > > 
> > > I think that would make a lot of sense. We do need to reduce the upgrade
> > > test matrix as much as we can, at least as a starting point.
> > > Our customers would have started with at least 1TB fs, so should not
> > > have a problem with minimum logsize on upgrade.
> > > 
> > > BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
> > > practice of users to start with a small fs and grow it, which is encouraged by
> > > Cloud providers pricing model.
> > > 
> > > I had asked Ted about the option to resize the ext4 journal and he replied
> > > that in theory it could be done, because the ext4 journal does not need to be
> > > contiguous. He thought that it was not the case for XFS though.
> > 
> > It's theoretically possible, but I'd bet that making it work reliably
> > will be difficult for an infrequent operation.  The old log would probably
> > have to clean itself, and then write a single transaction containing
> > both the bnobt update to allocate the new log as well as an EFI to erase
> > it.  Then you write to the new log a single transaction containing the
> > superblock and an EFI to free the old log.  Then you update the primary
> > super and force it out to disk, un-quiesce the log, and finish that EFI
> > so that the old log gets freed.
> > 
> > And then you have to go back and find the necessary parts that I missed.
> 
> The new log transaction to say "the new log is over there" so log
> recovery knows that the old log is being replaced and can go find
> the new log and recover it to free the old log.
> 
> IOWs, there's a heap of log recovery work needed, a new
> intent/transaction type, futzing with feature bits because old
> kernels won't be able to recovery such a operation, etc.
> 
> Then there's interesting issues that haven't ever been considered,
> like having a discontiguity in the LSN as we physically switch logs.
> What cycle number does the new log start at? What happens to all the
> head and tail tracking fields when we switch to the new log? What
> about all the log items in the AIL which is ordered by LSN? What
> about all the active log items that track a specific LSN for
> recovery integrity purposes (e.g. inode allocation buffers)? What
> about updating the reservation grant heads that track log space
> usage? Updating all the static size calculations used by the log
> code which has to be done before the new log can be written to via
> iclogs.
> 

If XFS were going to support an online switchover of the physical log,
why not do so across a quiesce? To try and do such a thing with active
records, log items, etc. that are unrelated to the operation seems
unnecessarily complex to me.

> The allocation of the new log extent and the freeing of the old log
> extent is the easy bit. Handling the failure cases to provide an
> atomic, always recoverable switch and managing all the runtime state
> and accounting changes that are necessary is the hard part...
> 

That suggests the "hard part" of the problem is primarily the online
switchover, but is that necessarily a strict requirement for a
reasonably useful/viable feature? ISTM that even being able to increase
the size of the log offline could be quite helpful for a filesystem that
has been grown via the cloudy very small -> very large antipattern. It
not only provides a recovery path for regular end-users, but at least
gives the cloudy dev guys a step to run during image deployment to avoid
the problem.

TBH, if one were to go through the trouble of making the log resizeable,
I start to wonder whether it's worth starting with a format change that
better accommodates future flexibility. For example, the internal log is
already AG allocated space.. why not do something like assign it to an
internal log inode attached to the sb? Then the log inode has the
obvious capability to allocate or free (non-active log) extents at
runtime through all the usual codepaths without disruption because the
log itself only cares about a target device, block offset and size. We
already know a bump of the log cycle count is sufficient for consistency
across a clean mount cycle because repair has been zapping clean logs by
default as such since pretty much forever.

That potentially reduces log reallocation to a switchover algorithm that
could run at mount time. I.e., a new prospective log extent is allocated
at runtime (and maybe flagged with an xattr or something). The next
mount identifies a new/prospective log, requires/verifies that the old
log is clean, selects the new log extent (based on some currently
undefined selection algorithm) and seeds it with the appropriate cycle
count via synchronous transactions that release any currently inactive
extent(s) from the log inode. Any failure along the way sticks with the
old log and releases the still inactive new extent, if it happens to
exist. We already do this sort of stale resource clean up for other
things like unlinked inodes and stale COW blocks, so the general premise
exists.. hm?

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-11 15:46         ` Brian Foster
@ 2022-05-11 22:24           ` Dave Chinner
  2022-05-16 13:35             ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2022-05-11 22:24 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs, Theodore Tso

On Wed, May 11, 2022 at 11:46:23AM -0400, Brian Foster wrote:
> On Wed, May 11, 2022 at 08:05:23AM +1000, Dave Chinner wrote:
> > On Tue, May 10, 2022 at 12:02:12PM -0700, Darrick J. Wong wrote:
> > > On Tue, May 10, 2022 at 09:21:03AM +0300, Amir Goldstein wrote:
> > > > On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > I think the upcoming nrext64 xfsprogs patches took in the first patch in
> > > > > that series.
> > > > >
> > > > > Question: Now that mkfs has a min logsize of 64MB, should we refuse
> > > > > upgrades for any filesystem with logsize < 64MB?
> > > > 
> > > > I think that would make a lot of sense. We do need to reduce the upgrade
> > > > test matrix as much as we can, at least as a starting point.
> > > > Our customers would have started with at least 1TB fs, so should not
> > > > have a problem with minimum logsize on upgrade.
> > > > 
> > > > BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
> > > > practice of users to start with a small fs and grow it, which is encouraged by
> > > > Cloud providers pricing model.
> > > > 
> > > > I had asked Ted about the option to resize the ext4 journal and he replied
> > > > that in theory it could be done, because the ext4 journal does not need to be
> > > > contiguous. He thought that it was not the case for XFS though.
> > > 
> > > It's theoretically possible, but I'd bet that making it work reliably
> > > will be difficult for an infrequent operation.  The old log would probably
> > > have to clean itself, and then write a single transaction containing
> > > both the bnobt update to allocate the new log as well as an EFI to erase
> > > it.  Then you write to the new log a single transaction containing the
> > > superblock and an EFI to free the old log.  Then you update the primary
> > > super and force it out to disk, un-quiesce the log, and finish that EFI
> > > so that the old log gets freed.
> > > 
> > > And then you have to go back and find the necessary parts that I missed.
> > 
> > The new log transaction to say "the new log is over there" so log
> > recovery knows that the old log is being replaced and can go find
> > the new log and recover it to free the old log.
> > 
> > IOWs, there's a heap of log recovery work needed, a new
> > intent/transaction type, futzing with feature bits because old
> > kernels won't be able to recovery such a operation, etc.
> > 
> > Then there's interesting issues that haven't ever been considered,
> > like having a discontiguity in the LSN as we physically switch logs.
> > What cycle number does the new log start at? What happens to all the
> > head and tail tracking fields when we switch to the new log? What
> > about all the log items in the AIL which is ordered by LSN? What
> > about all the active log items that track a specific LSN for
> > recovery integrity purposes (e.g. inode allocation buffers)? What
> > about updating the reservation grant heads that track log space
> > usage? Updating all the static size calculations used by the log
> > code which has to be done before the new log can be written to via
> > iclogs.
> > 
> 
> If XFS were going to support an online switchover of the physical log,
> why not do so across a quiesce? To try and do such a thing with active
> records, log items, etc. that are unrelated to the operation seems
> unnecessarily complex to me.

queisce state doesn't solve those problems. The old log has to have
committed active items in it when we switch (e.g. intent items for
the location of the new log, alloc records for the new log, etc) and
the new log has to have active items after the switch (EFI for the
old log, etc).  Just because the log is empty when we start the log
switch it doesn't mean it remains empty as we do the switchover
work.

If we want to avoid active items while writing stuff to the log,
then we going to need a custom path building lv chains to pass to
xlog_write(), like how we write unmount records.

i.e. what we probably need is a pair of custom log record that isn't
an transaction or unmount record that contains all the information
for telling the log where the new log is, and another at the start
of the new log that records where the old log was.

That way we only need to care about LSN, head/tail, start
offset, length, any changes to transaction reservations for
different physical log characteristics, etc

> TBH, if one were to go through the trouble of making the log resizeable,
> I start to wonder whether it's worth starting with a format change that
> better accommodates future flexibility. For example, the internal log is
> already AG allocated space.. why not do something like assign it to an
> internal log inode attached to the sb?  Then the log inode has the
> obvious capability to allocate or free (non-active log) extents at
> runtime through all the usual codepaths without disruption because the
> log itself only cares about a target device, block offset and size. We
> already know a bump of the log cycle count is sufficient for consistency
> across a clean mount cycle because repair has been zapping clean logs by
> default as such since pretty much forever.

Putting the log in an inode isn't needed for allocate/free of raw
extents - we already do that with grow/shrink for the tail of an AG.
Hence I'm not sure what virtually mapping the log actually gains us
over just a raw extent pointed to by the superblock?

> That potentially reduces log reallocation to a switchover algorithm that
> could run at mount time. I.e., a new prospective log extent is allocated
> at runtime (and maybe flagged with an xattr or something). The next
> mount identifies a new/prospective log, requires/verifies that the old
> log is clean, selects the new log extent (based on some currently
> undefined selection algorithm) and seeds it with the appropriate cycle
> count via synchronous transactions that release any currently inactive
> extent(s) from the log inode. Any failure along the way sticks with the
> old log and releases the still inactive new extent, if it happens to
> exist. We already do this sort of stale resource clean up for other
> things like unlinked inodes and stale COW blocks, so the general premise
> exists.. hm?

That seems like just as much special case infrastructure as a custom
log record type to run it online, and overall the algorithm isn't
much different. And the online change doesn't require an on-disk
format change...

Cheers,

Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-11 22:24           ` Dave Chinner
@ 2022-05-16 13:35             ` Brian Foster
  2022-05-17  9:30               ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2022-05-16 13:35 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs, Theodore Tso

On Thu, May 12, 2022 at 08:24:01AM +1000, Dave Chinner wrote:
> On Wed, May 11, 2022 at 11:46:23AM -0400, Brian Foster wrote:
> > On Wed, May 11, 2022 at 08:05:23AM +1000, Dave Chinner wrote:
> > > On Tue, May 10, 2022 at 12:02:12PM -0700, Darrick J. Wong wrote:
> > > > On Tue, May 10, 2022 at 09:21:03AM +0300, Amir Goldstein wrote:
> > > > > On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > I think the upcoming nrext64 xfsprogs patches took in the first patch in
> > > > > > that series.
> > > > > >
> > > > > > Question: Now that mkfs has a min logsize of 64MB, should we refuse
> > > > > > upgrades for any filesystem with logsize < 64MB?
> > > > > 
> > > > > I think that would make a lot of sense. We do need to reduce the upgrade
> > > > > test matrix as much as we can, at least as a starting point.
> > > > > Our customers would have started with at least 1TB fs, so should not
> > > > > have a problem with minimum logsize on upgrade.
> > > > > 
> > > > > BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
> > > > > practice of users to start with a small fs and grow it, which is encouraged by
> > > > > Cloud providers pricing model.
> > > > > 
> > > > > I had asked Ted about the option to resize the ext4 journal and he replied
> > > > > that in theory it could be done, because the ext4 journal does not need to be
> > > > > contiguous. He thought that it was not the case for XFS though.
> > > > 
> > > > It's theoretically possible, but I'd bet that making it work reliably
> > > > will be difficult for an infrequent operation.  The old log would probably
> > > > have to clean itself, and then write a single transaction containing
> > > > both the bnobt update to allocate the new log as well as an EFI to erase
> > > > it.  Then you write to the new log a single transaction containing the
> > > > superblock and an EFI to free the old log.  Then you update the primary
> > > > super and force it out to disk, un-quiesce the log, and finish that EFI
> > > > so that the old log gets freed.
> > > > 
> > > > And then you have to go back and find the necessary parts that I missed.
> > > 
> > > The new log transaction to say "the new log is over there" so log
> > > recovery knows that the old log is being replaced and can go find
> > > the new log and recover it to free the old log.
> > > 
> > > IOWs, there's a heap of log recovery work needed, a new
> > > intent/transaction type, futzing with feature bits because old
> > > kernels won't be able to recovery such a operation, etc.
> > > 
> > > Then there's interesting issues that haven't ever been considered,
> > > like having a discontiguity in the LSN as we physically switch logs.
> > > What cycle number does the new log start at? What happens to all the
> > > head and tail tracking fields when we switch to the new log? What
> > > about all the log items in the AIL which is ordered by LSN? What
> > > about all the active log items that track a specific LSN for
> > > recovery integrity purposes (e.g. inode allocation buffers)? What
> > > about updating the reservation grant heads that track log space
> > > usage? Updating all the static size calculations used by the log
> > > code which has to be done before the new log can be written to via
> > > iclogs.
> > > 
...
> 
> > TBH, if one were to go through the trouble of making the log resizeable,
> > I start to wonder whether it's worth starting with a format change that
> > better accommodates future flexibility. For example, the internal log is
> > already AG allocated space.. why not do something like assign it to an
> > internal log inode attached to the sb?  Then the log inode has the
> > obvious capability to allocate or free (non-active log) extents at
> > runtime through all the usual codepaths without disruption because the
> > log itself only cares about a target device, block offset and size. We
> > already know a bump of the log cycle count is sufficient for consistency
> > across a clean mount cycle because repair has been zapping clean logs by
> > default as such since pretty much forever.
> 
> Putting the log in an inode isn't needed for allocate/free of raw
> extents - we already do that with grow/shrink for the tail of an AG.
> Hence I'm not sure what virtually mapping the log actually gains us
> over just a raw extent pointed to by the superblock?
> 

I'm not sure what virtually mapping the log has to do with anything..?
That seems like something to consider for the future if there's value to
it..

> > That potentially reduces log reallocation to a switchover algorithm that
> > could run at mount time. I.e., a new prospective log extent is allocated
> > at runtime (and maybe flagged with an xattr or something). The next
> > mount identifies a new/prospective log, requires/verifies that the old
> > log is clean, selects the new log extent (based on some currently
> > undefined selection algorithm) and seeds it with the appropriate cycle
> > count via synchronous transactions that release any currently inactive
> > extent(s) from the log inode. Any failure along the way sticks with the
> > old log and releases the still inactive new extent, if it happens to
> > exist. We already do this sort of stale resource clean up for other
> > things like unlinked inodes and stale COW blocks, so the general premise
> > exists.. hm?
> 
> That seems like just as much special case infrastructure as a custom
> log record type to run it online, and overall the algorithm isn't
> much different. And the online change doesn't require an on-disk
> format change...
> 

It seems a lot more simple to me, assuming it's viable, but then again I
don't see a clear enough description of what you're comparing against to
reason about it with certainty. I also don't think it necessarily
depends on a format change. I.e., perhaps an online variant could look
something like the following (pseudo algorithm):

1. Userspace growfs feeds in a tmpfile or some such donor inode via
ioctl() that maps the new/prospective log as a data extent. Assume the
extent is ideally sized/allocated and disk content is properly formatted
as a valid log with an elevated cycle count for the purpose of this
sequence.

2. Quiesce the log, freeze the fs, xfs_log_unmount() the active log and
xfs_log_mount() the new range based on the extent mapped by the donor
inode.

3. Commit changes to the new log range to unmap the current extent (i.e.
the new log) from the donor inode and map the old log range, keeping the
in-core inode pinned/locked so it cannot be written back. At this point
the only changes on disk are to the donor's data extent (still mapped by
the inode). The superblock has not been updated, so any crash/recovery
will see the original log in the quiesced state, recover as normal and
process the donor file as an unlinked inode.

(The filesystem at this point has two recoverable logs. The original log
is unmodified since the quiesce and still referenced by the superblock.
It can be recovered in the event of a log switch failure. The new log
still resides in an inode extent mapping, but contains valid log records
to exchange its own extent mapping with the old log range).

4. Sync update the superblock to point at the new log range and unpin
the donor inode. A crash after this point sees the new log in the sb and
recovers the changes to map the old log into the donor inode. The old
log range is returned to free space upon unlinked inode processing of
the donor inode.

I'm not totally sure that works without prototyping some basic form of
it, but it seems generally sane to me. There are a lot of implementation
details to work through as well of course, such as whether the kernel
allocates the prospective log extent before the quiesce (and seeds it),
whether to support the ability to allocate the new in-core log subsystem
before tearing down the old (for more reliable failure handling and so
it can be activated atomically), how to format changes into the new log,
etc. etc., but that's far more detail than this is intended to cover...

Brian

> Cheers,
> 
> Dave.
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-16 13:35             ` Brian Foster
@ 2022-05-17  9:30               ` Dave Chinner
  2022-05-19 16:07                 ` Brian Foster
  0 siblings, 1 reply; 13+ messages in thread
From: Dave Chinner @ 2022-05-17  9:30 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs, Theodore Tso

On Mon, May 16, 2022 at 09:35:14AM -0400, Brian Foster wrote:
> On Thu, May 12, 2022 at 08:24:01AM +1000, Dave Chinner wrote:
> > On Wed, May 11, 2022 at 11:46:23AM -0400, Brian Foster wrote:
> > > On Wed, May 11, 2022 at 08:05:23AM +1000, Dave Chinner wrote:
> > > > On Tue, May 10, 2022 at 12:02:12PM -0700, Darrick J. Wong wrote:
> > > > > On Tue, May 10, 2022 at 09:21:03AM +0300, Amir Goldstein wrote:
> > > > > > On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > I think the upcoming nrext64 xfsprogs patches took in the first patch in
> > > > > > > that series.
> > > > > > >
> > > > > > > Question: Now that mkfs has a min logsize of 64MB, should we refuse
> > > > > > > upgrades for any filesystem with logsize < 64MB?
> > > > > > 
> > > > > > I think that would make a lot of sense. We do need to reduce the upgrade
> > > > > > test matrix as much as we can, at least as a starting point.
> > > > > > Our customers would have started with at least 1TB fs, so should not
> > > > > > have a problem with minimum logsize on upgrade.
> > > > > > 
> > > > > > BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
> > > > > > practice of users to start with a small fs and grow it, which is encouraged by
> > > > > > Cloud providers pricing model.
> > > > > > 
> > > > > > I had asked Ted about the option to resize the ext4 journal and he replied
> > > > > > that in theory it could be done, because the ext4 journal does not need to be
> > > > > > contiguous. He thought that it was not the case for XFS though.
> > > > > 
> > > > > It's theoretically possible, but I'd bet that making it work reliably
> > > > > will be difficult for an infrequent operation.  The old log would probably
> > > > > have to clean itself, and then write a single transaction containing
> > > > > both the bnobt update to allocate the new log as well as an EFI to erase
> > > > > it.  Then you write to the new log a single transaction containing the
> > > > > superblock and an EFI to free the old log.  Then you update the primary
> > > > > super and force it out to disk, un-quiesce the log, and finish that EFI
> > > > > so that the old log gets freed.
> > > > > 
> > > > > And then you have to go back and find the necessary parts that I missed.
> > > > 
> > > > The new log transaction to say "the new log is over there" so log
> > > > recovery knows that the old log is being replaced and can go find
> > > > the new log and recover it to free the old log.
> > > > 
> > > > IOWs, there's a heap of log recovery work needed, a new
> > > > intent/transaction type, futzing with feature bits because old
> > > > kernels won't be able to recovery such a operation, etc.
> > > > 
> > > > Then there's interesting issues that haven't ever been considered,
> > > > like having a discontiguity in the LSN as we physically switch logs.
> > > > What cycle number does the new log start at? What happens to all the
> > > > head and tail tracking fields when we switch to the new log? What
> > > > about all the log items in the AIL which is ordered by LSN? What
> > > > about all the active log items that track a specific LSN for
> > > > recovery integrity purposes (e.g. inode allocation buffers)? What
> > > > about updating the reservation grant heads that track log space
> > > > usage? Updating all the static size calculations used by the log
> > > > code which has to be done before the new log can be written to via
> > > > iclogs.
> > > > 
> ...
> > 
> > > TBH, if one were to go through the trouble of making the log resizeable,
> > > I start to wonder whether it's worth starting with a format change that
> > > better accommodates future flexibility. For example, the internal log is
> > > already AG allocated space.. why not do something like assign it to an
> > > internal log inode attached to the sb?  Then the log inode has the
> > > obvious capability to allocate or free (non-active log) extents at
> > > runtime through all the usual codepaths without disruption because the
> > > log itself only cares about a target device, block offset and size. We
> > > already know a bump of the log cycle count is sufficient for consistency
> > > across a clean mount cycle because repair has been zapping clean logs by
> > > default as such since pretty much forever.
> > 
> > Putting the log in an inode isn't needed for allocate/free of raw
> > extents - we already do that with grow/shrink for the tail of an AG.
> > Hence I'm not sure what virtually mapping the log actually gains us
> > over just a raw extent pointed to by the superblock?
> > 
> 
> I'm not sure what virtually mapping the log has to do with anything..?

That's what adding the inode BMBT indirection to point at the log
blocks is - it's virtually mapping the physical block location.

> That seems like something to consider for the future if there's value to
> it..

Virtual mapping as in "memory mapping the log" is a different
problem altogether. We've talked about that several times in the
context of DAX to avoid at least one level of memcpy() that the log
writes currently do on PMEM systems.

> > > That potentially reduces log reallocation to a switchover algorithm that
> > > could run at mount time. I.e., a new prospective log extent is allocated
> > > at runtime (and maybe flagged with an xattr or something). The next
> > > mount identifies a new/prospective log, requires/verifies that the old
> > > log is clean, selects the new log extent (based on some currently
> > > undefined selection algorithm) and seeds it with the appropriate cycle
> > > count via synchronous transactions that release any currently inactive
> > > extent(s) from the log inode. Any failure along the way sticks with the
> > > old log and releases the still inactive new extent, if it happens to
> > > exist. We already do this sort of stale resource clean up for other
> > > things like unlinked inodes and stale COW blocks, so the general premise
> > > exists.. hm?
> > 
> > That seems like just as much special case infrastructure as a custom
> > log record type to run it online, and overall the algorithm isn't
> > much different. And the online change doesn't require an on-disk
> > format change...
> > 
> 
> It seems a lot more simple to me, assuming it's viable, but then again I
> don't see a clear enough description of what you're comparing against to
> reason about it with certainty. I also don't think it necessarily
> depends on a format change. I.e., perhaps an online variant could look
> something like the following (pseudo algorithm):
> 
> 1. Userspace growfs feeds in a tmpfile or some such donor inode via
> ioctl() that maps the new/prospective log as a data extent. Assume the
> extent is ideally sized/allocated and disk content is properly formatted
> as a valid log with an elevated cycle count for the purpose of this
> sequence.

It would have to be a single contiguous extent and correctly
aligned to sunit/swidth as the original log would have been.
Otherwise stuff like log stripe unit IO optimisations get broken.

> 2. Quiesce the log, freeze the fs, xfs_log_unmount() the active log and
> xfs_log_mount() the new range based on the extent mapped by the donor
> inode.
> 
> 3. Commit changes to the new log range to unmap the current extent (i.e.
> the new log) from the donor inode and map the old log range, keeping the
> in-core inode pinned/locked so it cannot be written back.

So a specialised swap extent operation? Currently we can swap
individual extents between inodes (the rmap extent swap code), but I
don't think we can't swap an inode extent with a bare, unreferenced
"free space" extent right now.

> At this point
> the only changes on disk are to the donor's data extent (still mapped by
> the inode).

And potentially rmap btree updates because the type and owner of
both extents change. There's also the need to mark the old log
extent as unwritten, because if we crash iand recovery doesn't free
the donor inode, we don't want to risk exposing the old contents of
the log to userspace.

Also, Because rmap updates are deferred, the changes will span
multiple transactions via intents, so this will make use of the CIL
and, potentially, commit to the log and insert items into the AIL.

If we get items in the AIL, we are open to metadata writeback
starting while we are still doing the log changeover.  e.g. memory
pressure occurs, direct reclaim causes xfs_reclaim_inodes_nr() to
run which calls xfs_ail_push_all(mp->m_ail)....

Hence if we end up with new metadata changes in place, we can't go
back to the old log - we've got changes that are only sequenced in
the new log, so we can only recover at this point from the new log.

Yes, we can fix all these things, but we'll end up touching lots of
high level code in subtle ways to make this all work, and I can't
help but think this will end up being a bit fragile as a result.

> The superblock has not been updated, so any crash/recovery
> will see the original log in the quiesced state, recover as normal and
> process the donor file as an unlinked inode.
> 
> (The filesystem at this point has two recoverable logs. The original log
> is unmodified since the quiesce and still referenced by the superblock.
> It can be recovered in the event of a log switch failure. The new log
> still resides in an inode extent mapping, but contains valid log records
> to exchange its own extent mapping with the old log range).

Noted, but as per above, the potential for metadata writeback to
occur before we've switch logs means we can't just go back to the
old log. hence to make sure this works, we'll need new interlocks to
ensure operations like metadata writeback can't occur while the log
swap transactions are being performed.

> 4. Sync update the superblock to point at the new log range and unpin
> the donor inode.
> A crash after this point sees the new log in the sb and
> recovers the changes to map the old log into the donor inode. The old
> log range is returned to free space upon unlinked inode processing of
> the donor inode.

Yeah, the moment we write the superblock, we have to be prepared to
recover from the new log. That's why I think that both the old
log and the new log should just have an identical "log change"
record written to them after the unmount record. That simply records
the location/size of both the old log and the new log and we can
recover from that no matter which side of the superblock write we
crash on.

i.e. we have the same change record information in both logs and the
recovery action we need to perform only differs by the old log
having to first:

- update the superblock to point at the new log
- switch to the new log
- restart log recovery on the new log.

> I'm not totally sure that works without prototyping some basic form of
> it, but it seems generally sane to me.

I think we could make it work, but it my gut tells me it is the
wrong approach.

> There are a lot of implementation
> details to work through as well of course, such as whether the kernel
> allocates the prospective log extent before the quiesce (and seeds it),
> whether to support the ability to allocate the new in-core log subsystem
> before tearing down the old (for more reliable failure handling and so
> it can be activated atomically), how to format changes into the new log,
> etc. etc., but that's far more detail than this is intended to cover...

*nod*

I'm not doubting it could be done, but I feel that it requires
far more high level filesystem structures to be manipulated in
intricate ways than I'm comfortable with.

My main reservation is that using inodes, BMBT and RMAP updates for
physically swapping the location of the log involves layers of the
filesystem that the log itself should not be interacting with. I'm
just not comfortable to tying low level physical log layout
manipulations to high level transactional filesystem metadata
modifications that rely on writing to the log for recovery.

We can - and IMO should - run the physical location swap the
entirely within the log code itself without involving transactions,
log items, the CIL, AIL, etc. The log is capable of writing it's own
records (example: log unmount records), hence we can implement the
physical swap in a recoverable manner at that level without
involving an high level transactions or log items at all.

If we separate the transactional allocation/freeing of extents from
the exchange of the log location, then we don't have to try to work
out how to run a recoverable transaction that creates a
discontinuity in the mechanism that provides transaction recovery
and is completely recoverable itself.

The high level filesystem operation can provide the log with a new
extent that is already owned by the log and will remain that way
through the quiesce process. With that, the log location swap
doesn't need any other interactions with the transaction subsystem -
We just need to be able to write to the old log directly, to
initialise the new log with the same information, and to write the
superblock directly. The log already writes the superblock directly
as part of log recovery, so doing it as part of a log swap isn't an
issue.

Then we can hand the old extent back to the high level code and it
can free it. Once it is freed, we can queisce the log again and the
new unmount record effectively cancels the log swap record at the
start of the log as it is no longer between the head and tail.

Job done, and at that point the filesystem can be resumed on the new
log.

If you want to work out the high level transactional allocation,
quiesce and freeing side of things, I can write the low level "log
change" record and recovery code taht swaps the location of the log.
I think if we separate the two sets of operations and isolate them
within the appropriate layers, everything becomes a lot easier and
it doesn't need to fall on just one person to do everything...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-17  9:30               ` Dave Chinner
@ 2022-05-19 16:07                 ` Brian Foster
  2022-05-19 23:05                   ` Dave Chinner
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Foster @ 2022-05-19 16:07 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs, Theodore Tso

On Tue, May 17, 2022 at 07:30:41PM +1000, Dave Chinner wrote:
> On Mon, May 16, 2022 at 09:35:14AM -0400, Brian Foster wrote:
> > On Thu, May 12, 2022 at 08:24:01AM +1000, Dave Chinner wrote:
> > > On Wed, May 11, 2022 at 11:46:23AM -0400, Brian Foster wrote:
> > > > On Wed, May 11, 2022 at 08:05:23AM +1000, Dave Chinner wrote:
> > > > > On Tue, May 10, 2022 at 12:02:12PM -0700, Darrick J. Wong wrote:
> > > > > > On Tue, May 10, 2022 at 09:21:03AM +0300, Amir Goldstein wrote:
> > > > > > > On Mon, May 9, 2022 at 9:20 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > > > I think the upcoming nrext64 xfsprogs patches took in the first patch in
> > > > > > > > that series.
> > > > > > > >
> > > > > > > > Question: Now that mkfs has a min logsize of 64MB, should we refuse
> > > > > > > > upgrades for any filesystem with logsize < 64MB?
> > > > > > > 
> > > > > > > I think that would make a lot of sense. We do need to reduce the upgrade
> > > > > > > test matrix as much as we can, at least as a starting point.
> > > > > > > Our customers would have started with at least 1TB fs, so should not
> > > > > > > have a problem with minimum logsize on upgrade.
> > > > > > > 
> > > > > > > BTW, in LSFMM, Ted had a session about "Resize patterns" regarding the
> > > > > > > practice of users to start with a small fs and grow it, which is encouraged by
> > > > > > > Cloud providers pricing model.
> > > > > > > 
> > > > > > > I had asked Ted about the option to resize the ext4 journal and he replied
> > > > > > > that in theory it could be done, because the ext4 journal does not need to be
> > > > > > > contiguous. He thought that it was not the case for XFS though.
> > > > > > 
> > > > > > It's theoretically possible, but I'd bet that making it work reliably
> > > > > > will be difficult for an infrequent operation.  The old log would probably
> > > > > > have to clean itself, and then write a single transaction containing
> > > > > > both the bnobt update to allocate the new log as well as an EFI to erase
> > > > > > it.  Then you write to the new log a single transaction containing the
> > > > > > superblock and an EFI to free the old log.  Then you update the primary
> > > > > > super and force it out to disk, un-quiesce the log, and finish that EFI
> > > > > > so that the old log gets freed.
> > > > > > 
> > > > > > And then you have to go back and find the necessary parts that I missed.
> > > > > 
> > > > > The new log transaction to say "the new log is over there" so log
> > > > > recovery knows that the old log is being replaced and can go find
> > > > > the new log and recover it to free the old log.
> > > > > 
> > > > > IOWs, there's a heap of log recovery work needed, a new
> > > > > intent/transaction type, futzing with feature bits because old
> > > > > kernels won't be able to recovery such a operation, etc.
> > > > > 
> > > > > Then there's interesting issues that haven't ever been considered,
> > > > > like having a discontiguity in the LSN as we physically switch logs.
> > > > > What cycle number does the new log start at? What happens to all the
> > > > > head and tail tracking fields when we switch to the new log? What
> > > > > about all the log items in the AIL which is ordered by LSN? What
> > > > > about all the active log items that track a specific LSN for
> > > > > recovery integrity purposes (e.g. inode allocation buffers)? What
> > > > > about updating the reservation grant heads that track log space
> > > > > usage? Updating all the static size calculations used by the log
> > > > > code which has to be done before the new log can be written to via
> > > > > iclogs.
> > > > > 
> > ...
> > > 
> > > > TBH, if one were to go through the trouble of making the log resizeable,
> > > > I start to wonder whether it's worth starting with a format change that
> > > > better accommodates future flexibility. For example, the internal log is
> > > > already AG allocated space.. why not do something like assign it to an
> > > > internal log inode attached to the sb?  Then the log inode has the
> > > > obvious capability to allocate or free (non-active log) extents at
> > > > runtime through all the usual codepaths without disruption because the
> > > > log itself only cares about a target device, block offset and size. We
> > > > already know a bump of the log cycle count is sufficient for consistency
> > > > across a clean mount cycle because repair has been zapping clean logs by
> > > > default as such since pretty much forever.
> > > 
> > > Putting the log in an inode isn't needed for allocate/free of raw
> > > extents - we already do that with grow/shrink for the tail of an AG.
> > > Hence I'm not sure what virtually mapping the log actually gains us
> > > over just a raw extent pointed to by the superblock?
> > > 
> > 
> > I'm not sure what virtually mapping the log has to do with anything..?
> 
> That's what adding the inode BMBT indirection to point at the log
> blocks is - it's virtually mapping the physical block location.
> 
> > That seems like something to consider for the future if there's value to
> > it..
> 
> Virtual mapping as in "memory mapping the log" is a different
> problem altogether. We've talked about that several times in the
> context of DAX to avoid at least one level of memcpy() that the log
> writes currently do on PMEM systems.
> 

Ok.. we're getting a bit pedantic here. I realize that adding an extent
to an inode creates a virtual mapping, but nothing in this prospective
operation uses that mapping. The inode is used purely as a container to
hold a reference to an extent across a log quiesce.

I do think creating a virtually mapped log is an interesting future
consideration related to the aforementioned format change (i.e. hanging
a log inode off the sb), but that's a separate topic..

> > > > That potentially reduces log reallocation to a switchover algorithm that
> > > > could run at mount time. I.e., a new prospective log extent is allocated
> > > > at runtime (and maybe flagged with an xattr or something). The next
> > > > mount identifies a new/prospective log, requires/verifies that the old
> > > > log is clean, selects the new log extent (based on some currently
> > > > undefined selection algorithm) and seeds it with the appropriate cycle
> > > > count via synchronous transactions that release any currently inactive
> > > > extent(s) from the log inode. Any failure along the way sticks with the
> > > > old log and releases the still inactive new extent, if it happens to
> > > > exist. We already do this sort of stale resource clean up for other
> > > > things like unlinked inodes and stale COW blocks, so the general premise
> > > > exists.. hm?
> > > 
> > > That seems like just as much special case infrastructure as a custom
> > > log record type to run it online, and overall the algorithm isn't
> > > much different. And the online change doesn't require an on-disk
> > > format change...
> > > 
> > 
> > It seems a lot more simple to me, assuming it's viable, but then again I
> > don't see a clear enough description of what you're comparing against to
> > reason about it with certainty. I also don't think it necessarily
> > depends on a format change. I.e., perhaps an online variant could look
> > something like the following (pseudo algorithm):
> > 
> > 1. Userspace growfs feeds in a tmpfile or some such donor inode via
> > ioctl() that maps the new/prospective log as a data extent. Assume the
> > extent is ideally sized/allocated and disk content is properly formatted
> > as a valid log with an elevated cycle count for the purpose of this
> > sequence.
> 
> It would have to be a single contiguous extent and correctly
> aligned to sunit/swidth as the original log would have been.
> Otherwise stuff like log stripe unit IO optimisations get broken.
> 

Yeah, and this really could all happen on the kernel side rather than
userspace. I just wanted to keep the example simple and factor out as
many implementation details as possible to focus the discussion.

> > 2. Quiesce the log, freeze the fs, xfs_log_unmount() the active log and
> > xfs_log_mount() the new range based on the extent mapped by the donor
> > inode.
> > 
> > 3. Commit changes to the new log range to unmap the current extent (i.e.
> > the new log) from the donor inode and map the old log range, keeping the
> > in-core inode pinned/locked so it cannot be written back.
> 
> So a specialised swap extent operation? Currently we can swap
> individual extents between inodes (the rmap extent swap code), but I
> don't think we can't swap an inode extent with a bare, unreferenced
> "free space" extent right now.
> 

I suppose you could logically view it as a specialized extent swap, but
it's really just an inode unmap and map.

> > At this point
> > the only changes on disk are to the donor's data extent (still mapped by
> > the inode).
> 
> And potentially rmap btree updates because the type and owner of
> both extents change. There's also the need to mark the old log
> extent as unwritten, because if we crash iand recovery doesn't free
> the donor inode, we don't want to risk exposing the old contents of
> the log to userspace.
> 

rmap already refers to the active log, so we have to deal with rmap
updates regardless (if enabled).

The inode should be unlinked (or start unlinked as a tmpfile) before the
old log is quiesced such that the extent mapping is never accessible to
userspace. We could still mark the old log extent as unwritten if we
wanted to be extra safe, of course.

> Also, Because rmap updates are deferred, the changes will span
> multiple transactions via intents, so this will make use of the CIL
> and, potentially, commit to the log and insert items into the AIL.
> 
> If we get items in the AIL, we are open to metadata writeback
> starting while we are still doing the log changeover.  e.g. memory
> pressure occurs, direct reclaim causes xfs_reclaim_inodes_nr() to
> run which calls xfs_ail_push_all(mp->m_ail)....
> 

Good point. Committing a transaction and simply holding the inode lock
may not be sufficient to prevent writeback. I'm not sure this is so
difficult a problem to address, though...

The bmap map/unmap operations are also intent based dfops (i.e. used by
reflink remap). All the switchover really needs at this step is to put
something in the prospective log that ensures the mapping updates occur
in the event of a crash after the superblock update. All that requires
is getting the intents committed, which is already the initial substep
of transaction commit. IOW, I suspect all that may be necessary here is
to run the intent creation step of defer finish and to roll the
transaction. That would only log intents before the sb update and allow
the transaction to commit after the sb update.

Also note that the bmap deferred op completion already handles rmap
updates, if enabled, so there's nothing additional to do there wrt to
the inode. We would probably have to include rmapbt log record update
intents in the transaction, however.

> Hence if we end up with new metadata changes in place, we can't go
> back to the old log - we've got changes that are only sequenced in
> the new log, so we can only recover at this point from the new log.
> 

If we can just log the intents before updating the superblock, nothing
lands in the AIL. Since the operations are deferred by nature, the
in-core changes aren't even made until the new log is active.

> Yes, we can fix all these things, but we'll end up touching lots of
> high level code in subtle ways to make this all work, and I can't
> help but think this will end up being a bit fragile as a result.
> 

Perhaps, but I'd prefer to accurately pin down what may or may not be
involved before getting into that. I expect anything that implements a
log switch to be some degree of complex, invasive and potentially
fragile.

> > The superblock has not been updated, so any crash/recovery
> > will see the original log in the quiesced state, recover as normal and
> > process the donor file as an unlinked inode.
> > 
> > (The filesystem at this point has two recoverable logs. The original log
> > is unmodified since the quiesce and still referenced by the superblock.
> > It can be recovered in the event of a log switch failure. The new log
> > still resides in an inode extent mapping, but contains valid log records
> > to exchange its own extent mapping with the old log range).
> 
> Noted, but as per above, the potential for metadata writeback to
> occur before we've switch logs means we can't just go back to the
> old log. hence to make sure this works, we'll need new interlocks to
> ensure operations like metadata writeback can't occur while the log
> swap transactions are being performed.
> 
> > 4. Sync update the superblock to point at the new log range and unpin
> > the donor inode.
> > A crash after this point sees the new log in the sb and
> > recovers the changes to map the old log into the donor inode. The old
> > log range is returned to free space upon unlinked inode processing of
> > the donor inode.
> 
> Yeah, the moment we write the superblock, we have to be prepared to
> recover from the new log. That's why I think that both the old
> log and the new log should just have an identical "log change"
> record written to them after the unmount record. That simply records
> the location/size of both the old log and the new log and we can
> recover from that no matter which side of the superblock write we
> crash on.
> 
> i.e. we have the same change record information in both logs and the
> recovery action we need to perform only differs by the old log
> having to first:
> 
> - update the superblock to point at the new log
> - switch to the new log
> - restart log recovery on the new log.
> 

The prospect of a single "log termination" or "log chain/move" type
record sounds a lot more elegant to me than the initial train of thought
earlier in the thread of landing the alloc/free transactions to the old
log and the new log and having to validate that recovery works properly
at each step. I still don't have enough context to grok how the custom
record part provides recovery/consistency and simplifies the
implementation..

> > I'm not totally sure that works without prototyping some basic form of
> > it, but it seems generally sane to me.
> 
> I think we could make it work, but it my gut tells me it is the
> wrong approach.
> 

Fair enough. TBH, I'm not looking to argue or convince anybody about
what approach XFS should or should not take. I'm interested in fleshing
out the idea so viability can be assessed and the tradeoffs accurately
represented.

> > There are a lot of implementation
> > details to work through as well of course, such as whether the kernel
> > allocates the prospective log extent before the quiesce (and seeds it),
> > whether to support the ability to allocate the new in-core log subsystem
> > before tearing down the old (for more reliable failure handling and so
> > it can be activated atomically), how to format changes into the new log,
> > etc. etc., but that's far more detail than this is intended to cover...
> 
> *nod*
> 
> I'm not doubting it could be done, but I feel that it requires
> far more high level filesystem structures to be manipulated in
> intricate ways than I'm comfortable with.
> 
> My main reservation is that using inodes, BMBT and RMAP updates for
> physically swapping the location of the log involves layers of the
> filesystem that the log itself should not be interacting with. I'm
> just not comfortable to tying low level physical log layout
> manipulations to high level transactional filesystem metadata
> modifications that rely on writing to the log for recovery.
> 

As noted above, I'm not sure this really adds that much complexity. For
reference, an updated version of the sequence might look something like:

1. Allocate tmpfile and new log extent.

2. Quiesce old log, freeze fs, tear down old in-core log structures and
init the new.

3. Allocate a transaction (against the new log) to unmap the current
(new log) extent of the donor inode and map in the old log extent.
Create the intents and roll the transaction (but do not commit/finish).

4. Update the superblock to point at the new log.

5. Commit/finish the update transaction to perform the mapping changes.

The general idea is the same as before.. a crash before the sync update
tosses the prospective log on unlinked inode processing and a crash
after it processes the remapping intents and frees up the old log extent
in the same way.

I'm sure there are some caveats or warts buried in there (and obviously
it still involves a transaction), but if it fundamentally works it
actually may not be that much new code at all for the switchover piece.
The core functionality it depends on is the ability to run
xfs_defer_finish() in multiple steps. I think this is kind of how the
underlying mechanism works already, so may just involve some refactoring
to carry over pending dfops state across multiple calls.

> We can - and IMO should - run the physical location swap the
> entirely within the log code itself without involving transactions,
> log items, the CIL, AIL, etc. The log is capable of writing it's own
> records (example: log unmount records), hence we can implement the
> physical swap in a recoverable manner at that level without
> involving an high level transactions or log items at all.
> 
> If we separate the transactional allocation/freeing of extents from
> the exchange of the log location, then we don't have to try to work
> out how to run a recoverable transaction that creates a
> discontinuity in the mechanism that provides transaction recovery
> and is completely recoverable itself.
> 
> The high level filesystem operation can provide the log with a new
> extent that is already owned by the log and will remain that way
> through the quiesce process. With that, the log location swap
> doesn't need any other interactions with the transaction subsystem -
> We just need to be able to write to the old log directly, to
> initialise the new log with the same information, and to write the
> superblock directly. The log already writes the superblock directly
> as part of log recovery, so doing it as part of a log swap isn't an
> issue.
> 

I think I follow the approach you're describing of burying extent
ownership into the log itself with a custom/open-coded log record. I
conceptualize this as sort of an xfs_log_remount() mechanism from an old
range to a new one. It sounds sensible enough, but I'm not sure I follow
what the low level switchover sequence looks like, particularly how
crash recovery is handled. Could you run through the key steps that
provide the basic atomicity of the operation?

For example, I'll just take a stab and assume it goes something like:

1. Allocate the new log extent and commit it in a transaction with a
LOGCHANGE record that refers to the old range and new log range.

2. Do some sort of (partial?) quiesce/freeze that drains the AIL but
doesn't lose the opencoded LOGCHANGE intent record (and ensures it isn't
lost by covering etc.)..?? Tear down old log subsystem and init new one.

3. Format the new log appropriately (cycle, etc.) with a copy
(essentially a relog) of the LOGCHANGE intent. 

4. Sync update the superblock to point at the new log.

5. Free the old log extent via transaction that also includes a
LOGCHANGE intent done (to the new log).

So essentially a crash after step 2 recovers the original log and
restarts the log change op at the new log format step (since recovery
can't confirm whether it occurred) or a crash after step 4 recovers the
new log and frees the old based on whether the intent done record exists
or not.

Is that the right idea? If not, can you swizzle it into something that
is? If so, the part that is not so clear to me is what the magic looks
like to commit the extent alloc/frees with the open-coded log record(s)
without conflicting with quiesce (i.e. managing consistency with the
transactional part vs. the open coded record write part), or otherwise
managing consistency between the extent alloc and existence of the
custom record.

Brian

> Then we can hand the old extent back to the high level code and it
> can free it. Once it is freed, we can queisce the log again and the
> new unmount record effectively cancels the log swap record at the
> start of the log as it is no longer between the head and tail.
> 
> Job done, and at that point the filesystem can be resumed on the new
> log.
> 
> If you want to work out the high level transactional allocation,
> quiesce and freeing side of things, I can write the low level "log
> change" record and recovery code taht swaps the location of the log.
> I think if we separate the two sets of operations and isolate them
> within the appropriate layers, everything becomes a lot easier and
> it doesn't need to fall on just one person to do everything...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-19 16:07                 ` Brian Foster
@ 2022-05-19 23:05                   ` Dave Chinner
  0 siblings, 0 replies; 13+ messages in thread
From: Dave Chinner @ 2022-05-19 23:05 UTC (permalink / raw)
  To: Brian Foster; +Cc: Darrick J. Wong, Amir Goldstein, linux-xfs, Theodore Tso

On Thu, May 19, 2022 at 12:07:21PM -0400, Brian Foster wrote:
> On Tue, May 17, 2022 at 07:30:41PM +1000, Dave Chinner wrote:
> > On Mon, May 16, 2022 at 09:35:14AM -0400, Brian Foster wrote:
> > We can - and IMO should - run the physical location swap the
> > entirely within the log code itself without involving transactions,
> > log items, the CIL, AIL, etc. The log is capable of writing it's own
> > records (example: log unmount records), hence we can implement the
> > physical swap in a recoverable manner at that level without
> > involving an high level transactions or log items at all.
> > 
> > If we separate the transactional allocation/freeing of extents from
> > the exchange of the log location, then we don't have to try to work
> > out how to run a recoverable transaction that creates a
> > discontinuity in the mechanism that provides transaction recovery
> > and is completely recoverable itself.
> > 
> > The high level filesystem operation can provide the log with a new
> > extent that is already owned by the log and will remain that way
> > through the quiesce process. With that, the log location swap
> > doesn't need any other interactions with the transaction subsystem -
> > We just need to be able to write to the old log directly, to
> > initialise the new log with the same information, and to write the
> > superblock directly. The log already writes the superblock directly
> > as part of log recovery, so doing it as part of a log swap isn't an
> > issue.
> > 
> 
> I think I follow the approach you're describing of burying extent
> ownership into the log itself with a custom/open-coded log record. I
> conceptualize this as sort of an xfs_log_remount() mechanism from an old
> range to a new one. It sounds sensible enough, but I'm not sure I follow
> what the low level switchover sequence looks like, particularly how
> crash recovery is handled. Could you run through the key steps that
> provide the basic atomicity of the operation?
> 
> For example, I'll just take a stab and assume it goes something like:
> 
> 1. Allocate the new log extent and commit it in a transaction with a
> LOGCHANGE record that refers to the old range and new log range.
>
> 2. Do some sort of (partial?) quiesce/freeze that drains the AIL but
> doesn't lose the opencoded LOGCHANGE intent record (and ensures it isn't
> lost by covering etc.)..?? Tear down old log subsystem and init new one.
> 
> 3. Format the new log appropriately (cycle, etc.) with a copy
> (essentially a relog) of the LOGCHANGE intent. 
> 
> 4. Sync update the superblock to point at the new log.
> 
> 5. Free the old log extent via transaction that also includes a
> LOGCHANGE intent done (to the new log).

> So essentially a crash after step 2 recovers the original log and
> restarts the log change op at the new log format step (since recovery
> can't confirm whether it occurred) or a crash after step 4 recovers the
> new log and frees the old based on whether the intent done record exists
> or not.

Yup, that's pretty much the idea.

> Is that the right idea? If not, can you swizzle it into something that
> is? If so, the part that is not so clear to me is what the magic looks
> like to commit the extent alloc/frees with the open-coded log record(s)
> without conflicting with quiesce (i.e. managing consistency with the
> transactional part vs. the open coded record write part), or otherwise
> managing consistency between the extent alloc and existence of the
> custom record.

I was thinking of using CIL pinning to handle building the atomic
log change transactions that contain the necessary alloc/free
modifications.

In doing so, it means we can implement the LOG CHANGE record as a
different type of checkpoint transaction. The CIL flush just
includes more information in the struct xfs_trans_header it writes
at the start of the transaction to point at the old/new logs, and we
call that a XFS_TRANS_LOG_CHANGE_CHECKPOINT type instead of
XFS_TRANS_CHECKPOINT. Then we can recover the allocation
modifications in the checkpoint transaction using the existing code,
then perform the correct next recovery steps based on the current
superblock state.

With this trans header size change, old kernels will see a trans
header that is too long, so will refuse to recover at this point as
they think the log has been corrupted as they don't understand what
is in it. If we want to make the process more complex, we can
set a log incompat flag after the initial quiesce and clear it after
the final unmount record is written on completion.

Hence I think the overall process I think would be:

1. freeze the filesystem, quiesce the log. Write an unmount record
to place a hard recovery barrier in the log.

2. pin the CIL so it can't flush.

3. allocate an anonymous extent like we do in xfs_ag_shrink_space().
Then update the rmap to say it's owned by the log. These are
in-memory transactions at this point.

4. hand the extent to the log.

5. log formats the new extent.

6. log creates a new xlog instance pointing at the new log. This
completely duplicates all the log infrastructure (iclogs, etc).

7. log flushes the pinned CIL to current iclog with a LOG CHANGE
header record. iclog contents are duplicated into an iclog for the
new log, and modified for new log's cycle/block. New log iclog is
written first, then on completion the original iclog is written to
the old log.

8. update and write the superblock.

9. tear down the old xlog structure but don't free it.

10. memcpy() the new xlog over the top of the old one so we don't
have to update ever pointer to the log in the filesytem (e.g. every
log item attached to inodes that were once dirty, etc). This also
updates grant heads, log head/tail, etc to match the new log.

11. free the new xlog stucture without tearing it down.

12. push the CIL log items into the AIL with commit LSNs from the
new log.

13. hand the old log extent back to the caller that provided the new
log extent.

14. run transactions to free the old extent and run all the
associated deferops (e.g. rmap updates) that are needed.

15. Unpin the CIL and return to quiesced log state. This forces the
CIL to disk with a LOG CHANGE header record indicating completion of
the log change. Quiesce then writes all the metadata and updates the
tail with an unmount record, providing a hard recovery barrier in
the log.

16. unfreeze the filesystem with the new log active.

I suspect that some of the ordering in and around the setup and
changeover of the two logs will be needed - it's just a rough design
at this point, but that should give you an idea of what I was
thinking.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-09 18:20 ` Darrick J. Wong
  2022-05-10  6:21   ` Amir Goldstein
@ 2022-05-10  9:29   ` Christoph Hellwig
  2022-05-10 18:54     ` Darrick J. Wong
  1 sibling, 1 reply; 13+ messages in thread
From: Christoph Hellwig @ 2022-05-10  9:29 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Amir Goldstein, Dave Chinner, linux-xfs

On Mon, May 09, 2022 at 11:20:43AM -0700, Darrick J. Wong wrote:
> > Is there any a priori NACK or exceptional challenges w.r.t implementing
> > upgrade of xfs to reflink support?
> 
> No, just lack of immediate user demand + time to develop and merge code
> + time to QA the whole mess to make sure it doesn't introduce any
> messes.

I also have vague memory I did an in-kernel online upgrade back in the
day but we decided to not bother in the end.  Maybe some list archive
search could find it.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [QUESTION] Upgrade xfs filesystem to reflink support?
  2022-05-10  9:29   ` Christoph Hellwig
@ 2022-05-10 18:54     ` Darrick J. Wong
  0 siblings, 0 replies; 13+ messages in thread
From: Darrick J. Wong @ 2022-05-10 18:54 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Amir Goldstein, Dave Chinner, linux-xfs

On Tue, May 10, 2022 at 02:29:35AM -0700, Christoph Hellwig wrote:
> On Mon, May 09, 2022 at 11:20:43AM -0700, Darrick J. Wong wrote:
> > > Is there any a priori NACK or exceptional challenges w.r.t implementing
> > > upgrade of xfs to reflink support?
> > 
> > No, just lack of immediate user demand + time to develop and merge code
> > + time to QA the whole mess to make sure it doesn't introduce any
> > messes.
> 
> I also have vague memory I did an in-kernel online upgrade back in the
> day but we decided to not bother in the end.  Maybe some list archive
> search could find it.

https://lore.kernel.org/linux-xfs/1464877150-20457-1-git-send-email-hch@lst.de/

Note that your version did live upgrades of mounted filesystems
whereas the one I wrote does it as part of xfs_repair, which means we
have a somewhat better guarantee that the fs will be in decent shape
afterwards.

--D

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-05-19 23:05 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-05-09  7:50 [QUESTION] Upgrade xfs filesystem to reflink support? Amir Goldstein
2022-05-09 18:20 ` Darrick J. Wong
2022-05-10  6:21   ` Amir Goldstein
2022-05-10 19:02     ` Darrick J. Wong
2022-05-10 22:05       ` Dave Chinner
2022-05-11 15:46         ` Brian Foster
2022-05-11 22:24           ` Dave Chinner
2022-05-16 13:35             ` Brian Foster
2022-05-17  9:30               ` Dave Chinner
2022-05-19 16:07                 ` Brian Foster
2022-05-19 23:05                   ` Dave Chinner
2022-05-10  9:29   ` Christoph Hellwig
2022-05-10 18:54     ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox