Question: reserve log space at IO time for recover

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* Question: reserve log space at IO time for recover
@ 2023-07-18 22:57 Wengang Wang
  2023-07-19  0:11 ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Wengang Wang @ 2023-07-18 22:57 UTC (permalink / raw)
  To: linux-xfs@vger.kernel.org; +Cc: Srikanth C S

Hi,

I have a XFS metadump (was running with 4.14.35 plussing some back ported patches),
mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk
free journal space, while the transaction needs to reserve 360416 bytes to start the recovery.
Thus the mount hangs for ever. That happens with 4.14.35 kernel and also upstream
kernel (6.4.0).

The is the related stack dumping (6.4.0 kernel):

[<0>] xlog_grant_head_wait+0xbd/0x200 [xfs]
[<0>] xlog_grant_head_check+0xd9/0x100 [xfs]
[<0>] xfs_log_reserve+0xbc/0x1e0 [xfs]
[<0>] xfs_trans_reserve+0x138/0x170 [xfs]
[<0>] xfs_trans_alloc+0xe8/0x220 [xfs]
[<0>] xfs_efi_item_recover+0x110/0x250 [xfs]
[<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs]
[<0>] xlog_recover_finish+0x33/0x310 [xfs]
[<0>] xfs_log_mount_finish+0xdb/0x160 [xfs]
[<0>] xfs_mountfs+0x51c/0x900 [xfs]
[<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs]
[<0>] get_tree_bdev+0x193/0x280
[<0>] vfs_get_tree+0x26/0xd0
[<0>] path_mount+0x69d/0x9b0
[<0>] do_mount+0x7d/0xa0
[<0>] __x64_sys_mount+0xdc/0x100
[<0>] do_syscall_64+0x3b/0x90
[<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8

Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover
safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong).

So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to
ensure log recovery safe?  The number of UR is determined by current un flushed log items.
It gets increased just after transaction is committed and gets decreased when log items are
flushed. With the UR, we are safe to have enough log space for the transactions used by log
recovery.

thanks,
wengang

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-18 22:57 Question: reserve log space at IO time for recover Wengang Wang
@ 2023-07-19  0:11 ` Dave Chinner
  2023-07-19  1:44   ` Darrick J. Wong
  2023-07-19  1:46   ` Wengang Wang
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2023-07-19  0:11 UTC (permalink / raw)
  To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org, Srikanth C S

On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote:
> Hi,
> 
> I have a XFS metadump (was running with 4.14.35 plussing some back ported patches),
> mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk
> free journal space, while the transaction needs to reserve 360416 bytes to start the recovery.
> Thus the mount hangs for ever.

Most likely something went wrong at runtime on the 4.14.35 kernel
prior to the crash, leaving the on-disk state in an impossible to
recover state. Likely an accounting leak in a transaction
reservation somewhere, likely in passing the space used from the
transaction to the CIL. We've had bugs in this area before, they
eventually manifest in log hangs like this either at runtime or
during recovery...

> That happens with 4.14.35 kernel and also upstream
> kernel (6.4.0).

Upgrading the kernel won't fix recovery - it is likely that the
journal state on disk is invalid and so the mount cannot complete 

> The is the related stack dumping (6.4.0 kernel):
> 
> [<0>] xlog_grant_head_wait+0xbd/0x200 [xfs]
> [<0>] xlog_grant_head_check+0xd9/0x100 [xfs]
> [<0>] xfs_log_reserve+0xbc/0x1e0 [xfs]
> [<0>] xfs_trans_reserve+0x138/0x170 [xfs]
> [<0>] xfs_trans_alloc+0xe8/0x220 [xfs]
> [<0>] xfs_efi_item_recover+0x110/0x250 [xfs]
> [<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs]
> [<0>] xlog_recover_finish+0x33/0x310 [xfs]
> [<0>] xfs_log_mount_finish+0xdb/0x160 [xfs]
> [<0>] xfs_mountfs+0x51c/0x900 [xfs]
> [<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs]
> [<0>] get_tree_bdev+0x193/0x280
> [<0>] vfs_get_tree+0x26/0xd0
> [<0>] path_mount+0x69d/0x9b0
> [<0>] do_mount+0x7d/0xa0
> [<0>] __x64_sys_mount+0xdc/0x100
> [<0>] do_syscall_64+0x3b/0x90
> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> 
> Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover
> safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong).

Sure they do.

Log space usage is what the grant heads track; transactions are not
allowed to start if there isn't both reserve and write grant head
space available for them, and transaction rolls get held until there
is write grant space available for them (i.e. they can block in
xfs_trans_roll() -> xfs_trans_reserve() waiting for write grant head
space).

There have been bugs in the grant head accounting mechanisms in the
past, there may well still be bugs in it. But it is the grant head
mechanisms that is supposed to guarantee there is always space in
the journal for a transaction to commit, and by extension, ensure
that we always have space in the journal for a transaction to be
fully recovered.

> So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to
> ensure log recovery safe?  The number of UR is determined by current un flushed log items.
> It gets increased just after transaction is committed and gets decreased when log items are
> flushed. With the UR, we are safe to have enough log space for the transactions used by log
> recovery.

The grant heads already track log space usage and reservations like
this. If you want to learn more about the nitty gritty details, look
at this patch set that is aimed at changing how the grant heads
track the used/reserved log space to improve performance:

https://lore.kernel.org/linux-xfs/20221220232308.3482960-1-david@fromorbit.com/

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-19  0:11 ` Dave Chinner
@ 2023-07-19  1:44   ` Darrick J. Wong
  2023-07-19  6:25     ` Dave Chinner
  2023-07-19  1:46   ` Wengang Wang
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2023-07-19  1:44 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S

On Wed, Jul 19, 2023 at 10:11:03AM +1000, Dave Chinner wrote:
> On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote:
> > Hi,
> > 
> > I have a XFS metadump (was running with 4.14.35 plussing some back ported patches),
> > mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk
> > free journal space, while the transaction needs to reserve 360416 bytes to start the recovery.
> > Thus the mount hangs for ever.
> 
> Most likely something went wrong at runtime on the 4.14.35 kernel
> prior to the crash, leaving the on-disk state in an impossible to
> recover state. Likely an accounting leak in a transaction
> reservation somewhere, likely in passing the space used from the
> transaction to the CIL. We've had bugs in this area before, they
> eventually manifest in log hangs like this either at runtime or
> during recovery...
> 
> > That happens with 4.14.35 kernel and also upstream
> > kernel (6.4.0).
> 
> Upgrading the kernel won't fix recovery - it is likely that the
> journal state on disk is invalid and so the mount cannot complete 

Hmm.  It'd be nice to know what the kernel thought it was doing when it
went down.

/me wonders if this has anything to do with the EFI recovery creating a
transaction with tr_itruncate reservation because the log itself doesn't
record the reservations of the active transactions.

<begin handwaving>

Let's say you have a 1000K log, a tr_write reservation is 100k, and a
tr_itruncate reservations are 300k.  In this case, you could
theoretically have 10x tr_write transactions running concurrently; or
you could have 3x tr_itruncate transactions running concurrently.

Now let's say that someone fires up 10 programs that try to fpunch 10
separate files.  Those ten threads will consume all the log grant space,
unmap a block, and log an EFI.  I think in reality tr_logcount means
that 5 threads each consume (2*100k) grant space, but the point here is
that we've used up all the log grant space.

Then crash the system, having committed the first transaction of the
two-transaction chain.

Upon recovery, we'll find the 10x unfinished EFIs and pass them to EFI
recovery.  However, recovery creates a separate tr_itruncate transaction
to finish each EFI.  Now do we have a problem because the required log
grant space is 300k * 10 = 3000k?

It's late and I don't remember how recovery for non-intent items works
quite well enough to think that scenario adds up.  Maybe it was the case
that before the system went down, the log had used 800K of the grant
space for logged buffers and 100K for a single EFI logged in a tr_write
transaction.  Then we crashed, reloaded the 800K of stuff, and now we're
trying to allocate 300K for a tr_itruncate to restart the EFI, but
there's not enough log grant space?

<muggawuggamuggawugga>

--D

> > The is the related stack dumping (6.4.0 kernel):
> > 
> > [<0>] xlog_grant_head_wait+0xbd/0x200 [xfs]
> > [<0>] xlog_grant_head_check+0xd9/0x100 [xfs]
> > [<0>] xfs_log_reserve+0xbc/0x1e0 [xfs]
> > [<0>] xfs_trans_reserve+0x138/0x170 [xfs]
> > [<0>] xfs_trans_alloc+0xe8/0x220 [xfs]
> > [<0>] xfs_efi_item_recover+0x110/0x250 [xfs]
> > [<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs]
> > [<0>] xlog_recover_finish+0x33/0x310 [xfs]
> > [<0>] xfs_log_mount_finish+0xdb/0x160 [xfs]
> > [<0>] xfs_mountfs+0x51c/0x900 [xfs]
> > [<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs]
> > [<0>] get_tree_bdev+0x193/0x280
> > [<0>] vfs_get_tree+0x26/0xd0
> > [<0>] path_mount+0x69d/0x9b0
> > [<0>] do_mount+0x7d/0xa0
> > [<0>] __x64_sys_mount+0xdc/0x100
> > [<0>] do_syscall_64+0x3b/0x90
> > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> > 
> > Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover
> > safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong).
> 
> Sure they do.
> 
> Log space usage is what the grant heads track; transactions are not
> allowed to start if there isn't both reserve and write grant head
> space available for them, and transaction rolls get held until there
> is write grant space available for them (i.e. they can block in
> xfs_trans_roll() -> xfs_trans_reserve() waiting for write grant head
> space).
> 
> There have been bugs in the grant head accounting mechanisms in the
> past, there may well still be bugs in it. But it is the grant head
> mechanisms that is supposed to guarantee there is always space in
> the journal for a transaction to commit, and by extension, ensure
> that we always have space in the journal for a transaction to be
> fully recovered.
> 
> > So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to
> > ensure log recovery safe?  The number of UR is determined by current un flushed log items.
> > It gets increased just after transaction is committed and gets decreased when log items are
> > flushed. With the UR, we are safe to have enough log space for the transactions used by log
> > recovery.
> 
> The grant heads already track log space usage and reservations like
> this. If you want to learn more about the nitty gritty details, look
> at this patch set that is aimed at changing how the grant heads
> track the used/reserved log space to improve performance:
> 
> https://lore.kernel.org/linux-xfs/20221220232308.3482960-1-david@fromorbit.com/
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-19  1:44   ` Darrick J. Wong
@ 2023-07-19  6:25     ` Dave Chinner
  2023-07-21 19:36       ` Wengang Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2023-07-19  6:25 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S

On Tue, Jul 18, 2023 at 06:44:13PM -0700, Darrick J. Wong wrote:
> On Wed, Jul 19, 2023 at 10:11:03AM +1000, Dave Chinner wrote:
> > On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote:
> > > Hi,
> > > 
> > > I have a XFS metadump (was running with 4.14.35 plussing some back ported patches),
> > > mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk
> > > free journal space, while the transaction needs to reserve 360416 bytes to start the recovery.
> > > Thus the mount hangs for ever.
> > 
> > Most likely something went wrong at runtime on the 4.14.35 kernel
> > prior to the crash, leaving the on-disk state in an impossible to
> > recover state. Likely an accounting leak in a transaction
> > reservation somewhere, likely in passing the space used from the
> > transaction to the CIL. We've had bugs in this area before, they
> > eventually manifest in log hangs like this either at runtime or
> > during recovery...
> > 
> > > That happens with 4.14.35 kernel and also upstream
> > > kernel (6.4.0).
> > 
> > Upgrading the kernel won't fix recovery - it is likely that the
> > journal state on disk is invalid and so the mount cannot complete 
> 
> Hmm.  It'd be nice to know what the kernel thought it was doing when it
> went down.
> 
> /me wonders if this has anything to do with the EFI recovery creating a
> transaction with tr_itruncate reservation because the log itself doesn't
> record the reservations of the active transactions.

Possibly - it's been that way since 1994 but I don't recall it ever
causing any issues in the past. That's not to say it's correct - I
think it's wrong, but I think the whole transaction reservation
calculation infrastructure needs a complete overhaul....

>
> <begin handwaving>
> 
> Let's say you have a 1000K log, a tr_write reservation is 100k, and a
> tr_itruncate reservations are 300k.  In this case, you could
> theoretically have 10x tr_write transactions running concurrently; or
> you could have 3x tr_itruncate transactions running concurrently.
> 
> Now let's say that someone fires up 10 programs that try to fpunch 10
> separate files.  Those ten threads will consume all the log grant space,
> unmap a block, and log an EFI. I think in reality tr_logcount means
> that 5 threads each consume (2*100k) grant space, but the point here is
> that we've used up all the log grant space.
>
> Then crash the system, having committed the first transaction of the
> two-transaction chain.
> 
> Upon recovery, we'll find the 10x unfinished EFIs and pass them to EFI
> recovery.  However, recovery creates a separate tr_itruncate transaction
> to finish each EFI.  Now do we have a problem because the required log
> grant space is 300k * 10 = 3000k?

Hmmmm. That smells wrong. Can't put my finger on it .....

.... ah. Yeah. That.

We only run one transaction at a time, and we commit the transaction
after logging new intents and capturing the work that remains. So we
return the unused part of the reservation (most of it) back to the
log before we try to recover the next intent in the AIL.

Hence we don't need (300k * 10) in the log to recover these EFIs as
we don't hold all ten reservations at the same time (as we would
have at runtime) - we need (log space used by recovery of intents +
one reservation) to recover them all.

Once we've replayed all the intents from the AIL and converted them
into newly captured intents, they are removed from the AIL and that
moves the tail of the log forwards. This frees up the entire of the
log, and we then run the captured intents that still need to be
processed. We run them one at a time to completion, committing them
as we go, so again we only need space in the log for a single
transaction reservation to complete recovery of the intent chaings.

IOWs, because recovery of intents is single threaded, we only need
to preserve space in the log for a single reservation to make
forwards progress.

> It's late and I don't remember how recovery for non-intent items works
> quite well enough to think that scenario adds up.  Maybe it was the case
> that before the system went down, the log had used 800K of the grant
> space for logged buffers and 100K for a single EFI logged in a tr_write
> transaction.  Then we crashed, reloaded the 800K of stuff, and now we're
> trying to allocate 300K for a tr_itruncate to restart the EFI, but
> there's not enough log grant space?

Possible, if the EFI pins the tail of the log and the rest of the
log is full. I've never seen that happen, and we've been using
itrunc reservations in recovery since 1994, but that doesn't mean it
can't happen.

FWIW, I don't think this is EFI specific. BUI recovery use itrunc
reservations, but what if it was an unlink operation using a remove
reservation to free a directory block that logs a BUI? Same problem,
different vector, right?

I suspect what we really need is for all the intent processing to be
restartable. We already have those for RUIs to relog/restart them
there isn't enough reservation space available to complete
processing the current RUI. We just made EFIs restartable to avoid
busy extent deadlocks, we can easily extend that the same "enough
reservation available" as we use for RUIs, etc. Do the same for BUIs
and RUIs, and then...

... we can set up a reservation calculation for each intent type,
and the reservation needed for a given chain of operations is the
max of all the steps in the chain. Hence if we get part way through
a chain and run out of reservation, we can restart the chain the
the reservation we know is large enough to complete the remaining
part of the chain.

The new reservation may be smaller than the reservation that was
held when we start the intent processing (because it's a rolling
chain with an inherited log ticket), but this guarantees that we can
reduce the reservation to the minimum required at any point in
time if we are running low on log space....

This also gets around the problem of having to reserve enough space
for N operations (e.g. 4 extents in an EFI) when the vast majority
only use and need space for 1 operation. If we get an EFI with N
extents in it, we can try a reservation for (N * efi_res) and if we
can't get that we could just use efi_res and work through the EFI
one extent at a time....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-19  6:25     ` Dave Chinner
@ 2023-07-21 19:36       ` Wengang Wang
  2023-07-24  0:57         ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Wengang Wang @ 2023-07-21 19:36 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S

FYI:

I am able reproduce the XFS mount hang issue with hacked kernels based on
both 4.14.35 kernel or 6.4.0 kernel.

The hacking patch is like this (based on 6.4.0 kernel):

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f9e36b810663..e04e8738c564 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -491,6 +491,12 @@ xfs_extent_free_finish_item(
    struct xfs_extent_free_item *xefi;
    int             error;

+   if (strcmp(tp->t_mountp->m_super->s_id,"loop0") == 0) {
+       pr_err("loop0 free extent enter sleeping 5 hours\n");
+       msleep(3600*1000*5);
+       pr_err("loop0 free extent end sleeping 5 hours\n");
+   }
+
    xefi = container_of(item, struct xfs_extent_free_item, xefi_list);

    error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
index 7d4109af193e..b571fe6966af 100644
--- a/fs/xfs/xfs_trans_ail.c
+++ b/fs/xfs/xfs_trans_ail.c
@@ -557,6 +557,13 @@ xfsaild_push(
    xfs_trans_ail_cursor_done(&cur);
    spin_unlock(&ailp->ail_lock);

+   if (strcmp(ailp->ail_log->l_mp->m_super->s_id,"loop0") == 0) {
+       pr_err("loop0 aild enter sleeping 5 hours\n");
+       msleep(3600*1000*5); // sleep 5 hours
+       pr_err("loop0 aild end sleeping 5 hours\n");
+       pr_err("loop0 start issuing buffer IO\n");
+   }
+
    if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
        ailp->ail_log_flush++;

The hacking patch does two things:
1. sleeps 5 hours in the ext freeing deferred op before reserving log space.
   That simulates very slow extent free processing
2. sleeps 5 hours before flushing metadata buffers. That simulate very slow
   AIL processing.
Though the hacking patch sleeps 5 hours, it actually needs only 10 minutes
to reproduce this issue.

Reproduce steps:

1. create a XFS with 10MiB log size (small so easier to reproduce). The following
   steps all aim at this XFS volume.

2. with un-hacked kernel, create 100K files, 1 block each file, under a dedicated
   directory. We will overwrite those files causing inode timestamp logging.
   A timestamp logging (tr_fsyncts) requires/resevers 2508 bytes, which is less
   than the required size for 360416 (tr_itrunc) for 4.14.35 kernel. Not able to get
   the exact number for 6.4.0, but should be similar. With those timestamp updating,
    we are able to make the free log size less than 360416.

3. With hacked kernel, create a new file, allocating block(s) to it and remove it.
   Because of the hacking patch, EFI would be logged but the EFD will not in 5 hours.
   We will panic the kernel in 5 hours so that with next mount, EFI recover would
   happen.

4. with another console, overwrite serially that 100K files we created in step2.
   This get stuck after overwritten 57K+ files. That's because we used up (almost)
   all the free log space and we pinned log tail position with the hacking patch.
   Checking vmcore, It is stuck waiting for 2508 log bytes (1552 free bytes available).
   The numbers are exactly same with 4.14.35 kernel and 6.4.0 kernel.

5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
   kernel and 6.4.0 kernel. It’s not same as the free log space as seen in vmcore,
   1552 bytes. I am not able to figure oute why they are different yet. The delta aims
   to make log recover safe?

6. panic the kernel manually and reboot to un-hacked kernel. Verify the on disk
   free log space, it's the same as seen in step 5.

7. with un-hacked kernel, mount that xfs volume, it hang with both 4.14.35 kernel
   and 6.4.0 kernel. Checking vmcore, now the free log bytes now is the same as on
   disk free log bytes we found in step 6. That's for 4.14.25 kernel. I am not able
   to check the numbers for 6.4.0 kernel, but think it's same as seen in 4.14.35
   kernel. So reproduced!!

Checking back to the hacking patch, I am simulating very slow buffer flushing and
extent freeing. I'd think its possible that the same things can get slow somehow
in real production cases. So the hacking patch is reasonable.

When log free bytes runs out and a kernel panic or manual panic/reboot, like
  echo c/b >/proc/sysrq-trigger
happened, it's possible that even 6.4.0 kernel can run into the same sitiation as
seen in 4.14.35 kernel. So 6.4.0 kernel needs fix too.

I will look more on this.

thanks,
wengang

> On Jul 18, 2023, at 11:25 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 18, 2023 at 06:44:13PM -0700, Darrick J. Wong wrote:
>> On Wed, Jul 19, 2023 at 10:11:03AM +1000, Dave Chinner wrote:
>>> On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote:
>>>> Hi,
>>>> 
>>>> I have a XFS metadump (was running with 4.14.35 plussing some back ported patches),
>>>> mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk
>>>> free journal space, while the transaction needs to reserve 360416 bytes to start the recovery.
>>>> Thus the mount hangs for ever.
>>> 
>>> Most likely something went wrong at runtime on the 4.14.35 kernel
>>> prior to the crash, leaving the on-disk state in an impossible to
>>> recover state. Likely an accounting leak in a transaction
>>> reservation somewhere, likely in passing the space used from the
>>> transaction to the CIL. We've had bugs in this area before, they
>>> eventually manifest in log hangs like this either at runtime or
>>> during recovery...
>>> 
>>>> That happens with 4.14.35 kernel and also upstream
>>>> kernel (6.4.0).
>>> 
>>> Upgrading the kernel won't fix recovery - it is likely that the
>>> journal state on disk is invalid and so the mount cannot complete 
>> 
>> Hmm.  It'd be nice to know what the kernel thought it was doing when it
>> went down.
>> 
>> /me wonders if this has anything to do with the EFI recovery creating a
>> transaction with tr_itruncate reservation because the log itself doesn't
>> record the reservations of the active transactions.
> 
> Possibly - it's been that way since 1994 but I don't recall it ever
> causing any issues in the past. That's not to say it's correct - I
> think it's wrong, but I think the whole transaction reservation
> calculation infrastructure needs a complete overhaul....
> 
>> 
>> <begin handwaving>
>> 
>> Let's say you have a 1000K log, a tr_write reservation is 100k, and a
>> tr_itruncate reservations are 300k.  In this case, you could
>> theoretically have 10x tr_write transactions running concurrently; or
>> you could have 3x tr_itruncate transactions running concurrently.
>> 
>> Now let's say that someone fires up 10 programs that try to fpunch 10
>> separate files.  Those ten threads will consume all the log grant space,
>> unmap a block, and log an EFI. I think in reality tr_logcount means
>> that 5 threads each consume (2*100k) grant space, but the point here is
>> that we've used up all the log grant space.
>> 
>> Then crash the system, having committed the first transaction of the
>> two-transaction chain.
>> 
>> Upon recovery, we'll find the 10x unfinished EFIs and pass them to EFI
>> recovery.  However, recovery creates a separate tr_itruncate transaction
>> to finish each EFI.  Now do we have a problem because the required log
>> grant space is 300k * 10 = 3000k?
> 
> Hmmmm. That smells wrong. Can't put my finger on it .....
> 
> 
> .... ah. Yeah. That.
> 
> We only run one transaction at a time, and we commit the transaction
> after logging new intents and capturing the work that remains. So we
> return the unused part of the reservation (most of it) back to the
> log before we try to recover the next intent in the AIL.
> 
> Hence we don't need (300k * 10) in the log to recover these EFIs as
> we don't hold all ten reservations at the same time (as we would
> have at runtime) - we need (log space used by recovery of intents +
> one reservation) to recover them all.
> 
> Once we've replayed all the intents from the AIL and converted them
> into newly captured intents, they are removed from the AIL and that
> moves the tail of the log forwards. This frees up the entire of the
> log, and we then run the captured intents that still need to be
> processed. We run them one at a time to completion, committing them
> as we go, so again we only need space in the log for a single
> transaction reservation to complete recovery of the intent chaings.
> 
> IOWs, because recovery of intents is single threaded, we only need
> to preserve space in the log for a single reservation to make
> forwards progress.
> 
>> It's late and I don't remember how recovery for non-intent items works
>> quite well enough to think that scenario adds up.  Maybe it was the case
>> that before the system went down, the log had used 800K of the grant
>> space for logged buffers and 100K for a single EFI logged in a tr_write
>> transaction.  Then we crashed, reloaded the 800K of stuff, and now we're
>> trying to allocate 300K for a tr_itruncate to restart the EFI, but
>> there's not enough log grant space?
> 
> Possible, if the EFI pins the tail of the log and the rest of the
> log is full. I've never seen that happen, and we've been using
> itrunc reservations in recovery since 1994, but that doesn't mean it
> can't happen.
> 
> FWIW, I don't think this is EFI specific. BUI recovery use itrunc
> reservations, but what if it was an unlink operation using a remove
> reservation to free a directory block that logs a BUI? Same problem,
> different vector, right?
> 
> I suspect what we really need is for all the intent processing to be
> restartable. We already have those for RUIs to relog/restart them
> there isn't enough reservation space available to complete
> processing the current RUI. We just made EFIs restartable to avoid
> busy extent deadlocks, we can easily extend that the same "enough
> reservation available" as we use for RUIs, etc. Do the same for BUIs
> and RUIs, and then...
> 
> ... we can set up a reservation calculation for each intent type,
> and the reservation needed for a given chain of operations is the
> max of all the steps in the chain. Hence if we get part way through
> a chain and run out of reservation, we can restart the chain the
> the reservation we know is large enough to complete the remaining
> part of the chain.
> 
> The new reservation may be smaller than the reservation that was
> held when we start the intent processing (because it's a rolling
> chain with an inherited log ticket), but this guarantees that we can
> reduce the reservation to the minimum required at any point in
> time if we are running low on log space....
> 
> This also gets around the problem of having to reserve enough space
> for N operations (e.g. 4 extents in an EFI) when the vast majority
> only use and need space for 1 operation. If we get an EFI with N
> extents in it, we can try a reservation for (N * efi_res) and if we
> can't get that we could just use efi_res and work through the EFI
> one extent at a time....
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-21 19:36       ` Wengang Wang
@ 2023-07-24  0:57         ` Dave Chinner
  2023-07-24 18:03           ` Wengang Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2023-07-24  0:57 UTC (permalink / raw)
  To: Wengang Wang; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S

On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> FYI:
> 
> I am able reproduce the XFS mount hang issue with hacked kernels based on
> both 4.14.35 kernel or 6.4.0 kernel.
> 
> The hacking patch is like this (based on 6.4.0 kernel):
> 
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index f9e36b810663..e04e8738c564 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
> @@ -491,6 +491,12 @@ xfs_extent_free_finish_item(
>     struct xfs_extent_free_item *xefi;
>     int             error;
> 
> +   if (strcmp(tp->t_mountp->m_super->s_id,"loop0") == 0) {
> +       pr_err("loop0 free extent enter sleeping 5 hours\n");
> +       msleep(3600*1000*5);
> +       pr_err("loop0 free extent end sleeping 5 hours\n");
> +   }
>     xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
> 
>     error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
> index 7d4109af193e..b571fe6966af 100644
> --- a/fs/xfs/xfs_trans_ail.c
> +++ b/fs/xfs/xfs_trans_ail.c
> @@ -557,6 +557,13 @@ xfsaild_push(
>     xfs_trans_ail_cursor_done(&cur);
>     spin_unlock(&ailp->ail_lock);
> 
> +   if (strcmp(ailp->ail_log->l_mp->m_super->s_id,"loop0") == 0) {
> +       pr_err("loop0 aild enter sleeping 5 hours\n");
> +       msleep(3600*1000*5); // sleep 5 hours
> +       pr_err("loop0 aild end sleeping 5 hours\n");
> +       pr_err("loop0 start issuing buffer IO\n");
> +   }
> +
>     if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
>         ailp->ail_log_flush++;
> 
> The hacking patch does two things:
> 1. sleeps 5 hours in the ext freeing deferred op before reserving log space.

No it doesn't. The delay you added above occurs *after* the
transaction has already guaranteed it has full log reservation for
the extent freeing operation. You even use the transaction pointer
to get the mount id (loop0) before stalling it....

>    That simulates very slow extent free processing

> 2. sleeps 5 hours before flushing metadata buffers. That simulate very slow
>    AIL processing.

Well, it actually simulates *completely stalled* AIL processing.
If it was simulating slow processing, then the tail of the log would
always still be moving forwards slowly. Stalling metadata writeback
for 5 hours is effectively hanging the journal for 5 hours, not
making it "go slow". Those are very different situations that result
in very different behaviours.

> Though the hacking patch sleeps 5 hours, it actually needs only 10 minutes
> to reproduce this issue.

You should be able to fully stall the log and filesystem in under a
second with this patch.

> Reproduce steps:
> 
> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>    steps all aim at this XFS volume.

Actually, make that a few milliseconds.... :)

mkfs/xfs_info output would be appreciated.

> 2. with un-hacked kernel, create 100K files, 1 block each file, under a dedicated
>    directory. We will overwrite those files causing inode timestamp logging.

Ok, that's step 4, right? You're not doing that here?

>    A timestamp logging (tr_fsyncts) requires/resevers 2508 bytes, which is less
>    than the required size for 360416 (tr_itrunc) for 4.14.35 kernel. Not able to get
>    the exact number for 6.4.0, but should be similar. With those timestamp updating,
>     we are able to make the free log size less than 360416.
>
> 3. With hacked kernel, create a new file, allocating block(s) to it and remove it.
>    Because of the hacking patch, EFI would be logged but the EFD will not in 5 hours.
>    We will panic the kernel in 5 hours so that with next mount, EFI recover would
>    happen.

Ok, so one EFI at the tail of the log, with a full reserve grant
head and write grant head reservation for the extent free
transaction that has been blocked.

> 4. with another console, overwrite serially that 100K files we created in step2.
>    This get stuck after overwritten 57K+ files. That's because we used up (almost)
>    all the free log space and we pinned log tail position with the hacking patch.
>    Checking vmcore, It is stuck waiting for 2508 log bytes (1552 free bytes available).
>    The numbers are exactly same with 4.14.35 kernel and 6.4.0 kernel.

Yup, and the reserve grant head should also include at least one
full tr_itrunc reservation for the EFI in progress.

> 
> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>    kernel and 6.4.0 kernel.

Which is is clearly wrong. It should be at least 360416 bytes (i.e
tr_itrunc), because that's what the EFI being processed that pins
the tail of the log is supposed to have reserved when it was
stalled. So where has the ~180kB of leaked space come from?

Have you traced the grant head reservations to find out
what the runtime log space and grant head reservations actually are?

i.e. we have full tracing of the log reservation accounting via
tracepoints in the kernel. If there is a leak occurring, you need to
capture a trace of all the reservation accounting operations and
post process the output to find out what operation is leaking
reserved space. e.g.

# trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
....
# trace-cmd report > s.t
# head -3 s.t
cpus=16
          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2

#

So this tells us the transaction reservation unit size, the count of
reservations, the current reserve and grant head locations, and the
current head and tail of the log at the time the transaction
reservation is started and then after it completes.

Hence we can tell if the reservation moved the grant head correctly,
whether there was space available prior to the reservation being
made, whether the grant heads contain in-memory reservations or
whether they are empty (i.e. match the log tail), etc. 

All cases where we change the reservation and the grant heads are
also captured by the above trace-cmd event filter, so a full picture
of the grant head reservations vs log head/tail can be determined
from post-processing the events....

This is how I've found log space leaks in the past. It's entirely
possible that we are leaking a couple of bytes somewhere in the
accounting, and that's the cause of the issue. It's happened before,
and the accounting is complex enough it will most likely happen
again...

>    It’s not same as the free log space as seen in vmcore,
>    1552 bytes. I am not able to figure oute why they are different yet. The delta aims
>    to make log recover safe?

Because different runtime conditions lead to different free log space
state.

> 6. panic the kernel manually and reboot to un-hacked kernel. Verify the on disk
>    free log space, it's the same as seen in step 5.

Right, so something has gone wrong with the runtime accounting, and
log recovery is just the messenger. We need to find out what went
wrong at runtime, not hack about in recovery trying to handle a
situation that should not be happening in the first place...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-24  0:57         ` Dave Chinner
@ 2023-07-24 18:03           ` Wengang Wang
  2023-07-26  4:08             ` Dave Chinner
  0 siblings, 1 reply; 22+ messages in thread
From: Wengang Wang @ 2023-07-24 18:03 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S



> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
>> FYI:
>> 
>> I am able reproduce the XFS mount hang issue with hacked kernels based on
>> both 4.14.35 kernel or 6.4.0 kernel.
>> 
>> The hacking patch is like this (based on 6.4.0 kernel):
>> 
>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>> index f9e36b810663..e04e8738c564 100644
>> --- a/fs/xfs/xfs_extfree_item.c
>> +++ b/fs/xfs/xfs_extfree_item.c
>> @@ -491,6 +491,12 @@ xfs_extent_free_finish_item(
>>    struct xfs_extent_free_item *xefi;
>>    int             error;
>> 
>> +   if (strcmp(tp->t_mountp->m_super->s_id,"loop0") == 0) {
>> +       pr_err("loop0 free extent enter sleeping 5 hours\n");
>> +       msleep(3600*1000*5);
>> +       pr_err("loop0 free extent end sleeping 5 hours\n");
>> +   }
>>    xefi = container_of(item, struct xfs_extent_free_item, xefi_list);
>> 
>>    error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi);
>> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c
>> index 7d4109af193e..b571fe6966af 100644
>> --- a/fs/xfs/xfs_trans_ail.c
>> +++ b/fs/xfs/xfs_trans_ail.c
>> @@ -557,6 +557,13 @@ xfsaild_push(
>>    xfs_trans_ail_cursor_done(&cur);
>>    spin_unlock(&ailp->ail_lock);
>> 
>> +   if (strcmp(ailp->ail_log->l_mp->m_super->s_id,"loop0") == 0) {
>> +       pr_err("loop0 aild enter sleeping 5 hours\n");
>> +       msleep(3600*1000*5); // sleep 5 hours
>> +       pr_err("loop0 aild end sleeping 5 hours\n");
>> +       pr_err("loop0 start issuing buffer IO\n");
>> +   }
>> +
>>    if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list))
>>        ailp->ail_log_flush++;
>> 
>> The hacking patch does two things:
>> 1. sleeps 5 hours in the ext freeing deferred op before reserving log space.
> 
> No it doesn't. The delay you added above occurs *after* the
> transaction has already guaranteed it has full log reservation for
> the extent freeing operation. You even use the transaction pointer
> to get the mount id (loop0) before stalling it....

Right, I wanted to do so but made wrong patch. I should do it in the
transaction rolling after committing the orig transaction and before reserving
new resources.
 
> 
>>   That simulates very slow extent free processing
> 
>> 2. sleeps 5 hours before flushing metadata buffers. That simulate
> very slow
>>   AIL processing.
> 
> Well, it actually simulates *completely stalled* AIL processing.
> If it was simulating slow processing, then the tail of the log would
> always still be moving forwards slowly. Stalling metadata writeback
> for 5 hours is effectively hanging the journal for 5 hours, not
> making it "go slow". Those are very different situations that result
> in very different behaviours.
> 
>> Though the hacking patch sleeps 5 hours, it actually needs only 10 minutes
>> to reproduce this issue.
> 
> You should be able to fully stall the log and filesystem in under a
> second with this patch.
> 

The script allocates transaction reserving resource and then commit the
transaction then another round. I didn’t want there are pending (with resources
reserved but not committed) transaction(s) exist when the free log bytes
exhausts. If there are enough pending transactions in memory, the on-disk
free log bytes could be enough for EFI recover. So the the script didn’t
use up (almost) all the free log bytes that quickly. The script can be improved,
well, it already indicates things.
 

>> Reproduce steps:
>> 
>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>>   steps all aim at this XFS volume.
> 
> Actually, make that a few milliseconds.... :)
> 

:)

> mkfs/xfs_info output would be appreciated.

sure,
# xfs_info 20GB.bk2
meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=0        finobt=0, sparse=0, rmapbt=0
         =                       reflink=0
data     =                       bsize=4096   blocks=5242880, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


> 
>> 2. with un-hacked kernel, create 100K files, 1 block each file, under a dedicated
>>   directory.
> We will overwrite those files causing inode timestamp logging.
> 
> Ok, that's step 4, right? You're not doing that here?

Yes, that’s step 4.

> 
>>   A timestamp logging (tr_fsyncts) requires/resevers 2508 bytes, which is less
>>   than the
> required size for 360416 (tr_itrunc) for 4.14.35 kernel. Not able to get
>>   the exact number for 6.4.0, but should be similar. With those timestamp updating,
>>    we are able to make the free log size less than
> 360416.
>> 
>> 3. With hacked kernel, create a new file, allocating block(s) to it and remove it.
>>   Because of the hacking patch, EFI would be logged but the EFD will not in 5 hours.
>>   We will panic the kernel in 5 hours so that with next mount, EFI recover would
>>   happen.
> 
> Ok, so one EFI at the tail of the log, with a full reserve grant
> head and write grant head reservation for the extent free
> transaction that has been blocked.

Yes, with current patch (Though I didn’t want the reservation).

> 
>> 4. with another console, overwrite serially that 100K files we created in step2.
>>   This get stuck after overwritten 57K+ files. That's because we used up (almost)
>>   all the free log space and we pinned log tail position with the hacking patch.
>> 
> Checking vmcore, It is stuck waiting for 2508 log bytes (1552 free bytes available).
>>   The numbers are exactly same with 4.14.35 kernel and 6.4.0 kernel.
> 
> Yup, and the reserve grant head should also include at least one
> full tr_itrunc reservation for the EFI in progress.

Yes (though I didn’t want that tr_itrunc reservation).

> 
>> 
>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>>   kernel and 6.4.0 kernel.
> 
> Which is is clearly wrong. It should be at least 360416 bytes (i.e
> tr_itrunc), because that's what the EFI being processed that pins
> the tail of the log is supposed to have reserved when it was
> stalled.

Yep, exactly.

> So where has the ~180kB of leaked space come from?
> 
> Have you traced the grant head reservations to find out
> what the runtime log space and grant head reservations actually are?
I have the numbers in vmcore (ignore the WARNs),

     74   l_reserve_head
     75     head position:   cycle: 8 bytes: 2978800
     76     tail position:   cycle: 7 bytes: 2980352
     77     lastSync pos :   cycle: 8 bytes: 2797568
     78     available_size with head: 1552 bytes
     79     1 log space reservers waiting
     80 DEBUG: *(struct xlog_ticket *)0xffff9ba296dc68a0 = {
     81         .t_queue = (struct list_head){
     82                 .next = (struct list_head *)0xffff9ba2cfaabd88,
     83                 .prev = (struct list_head *)0xffff9ba2cfaabd88,
     84         },
     85         .t_task = (struct task_struct *)0xffff9ba280fc4180,
     86         .t_tid = (xlog_tid_t)1461406111,
     87         .t_ref = (atomic_t){
     88                 .counter = (int)1,
     89         },
     90         .t_curr_res = (int)2508,
     91         .t_unit_res = (int)2508,
     92         .t_ocnt = (char)0,
     93         .t_cnt = (char)0,
     94         .t_flags = (uint16_t)0,
     95         .t_iclog_hdrs = (int)1,
     96 }
     97       process 126306 needs: 2508 bytes
     98     previous waiters need: 2508 bytes
     99 WARN: Pending transactions not checked (to save time)
    100     0 log bytes reserved for active transactions (this FS)
    101     Reserved and to be reserved less than log size, seems xlog.l_tail_lsn not updated some how
    102   l_write_head
    103     head position:   cycle: 8 bytes: 2978800
    104     tail position:   cycle: 7 bytes: 2980352
    105     lastSync pos :   cycle: 8 bytes: 2797568
    106     available_size with head: 1552 bytes
    107     0 log space reservers waiting
    108 WARN: Pending transactions not checked (to save time)
    109     0 log bytes reserved for active transactions (this FS)

> 
> i.e. we have full tracing of the log reservation accounting via
> tracepoints in the kernel. If there is a leak occurring, you need to
> capture a trace of all the reservation accounting operations and
> post process the output to find out what operation is leaking
> reserved space. e.g.
> 
> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> ....
> # trace-cmd report > s.t
> # head -3 s.t
> cpus=16
>          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> 
> #
> 
> So this tells us the transaction reservation unit size, the count of
> reservations, the current reserve and grant head locations, and the
> current head and tail of the log at the time the transaction
> reservation is started and then after it completes.

Will do that and report back. You want full log or only some typical
ones? Full log would be big, how shall I share? 

> 
> Hence we can tell if the reservation moved the grant head correctly,
> whether there was space available prior to the reservation being
> made, whether the grant heads contain in-memory reservations or
> whether they are empty (i.e. match the log tail), etc. 
> 
> All cases where we change the reservation and the grant heads are
> also captured by the above trace-cmd event filter, so a full picture
> of the grant head reservations vs log head/tail can be determined
> from post-processing the events....
> 
> This is how I've found log space leaks in the past. It's entirely
> possible that we are leaking a couple of bytes somewhere in the
> accounting, and that's the cause of the issue. It's happened before,
> and the accounting is complex enough it will most likely happen
> again...
> 
>>   It’s not same as the free log space as seen in vmcore,
>>   1552 bytes.
> I am not able to figure oute why they are different yet. The delta aims
>>   to make log recover safe?
> 
> Because different runtime conditions lead to different free log space
> state.
> 
>> 6. panic the kernel manually and reboot to un-hacked kernel. Verify the on disk
>>   free log space, it's the same as seen in step 5.
> 
> Right, so something has gone wrong with the runtime accounting, and
> log recovery is just the messenger. We need to find out what went
> wrong at runtime, not hack about in recovery trying to handle a
> situation that should not be happening in the first place...
> 

Yes, agreed. (I was/am not chasing to fix it in recovery).

thanks,
wengang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-24 18:03           ` Wengang Wang
@ 2023-07-26  4:08             ` Dave Chinner
  2023-07-26 15:23               ` Darrick J. Wong
  2023-07-28 17:56               ` Wengang Wang
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2023-07-26  4:08 UTC (permalink / raw)
  To: Wengang Wang; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S

On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
> > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> >> FYI:
> >> 
> >> I am able reproduce the XFS mount hang issue with hacked kernels based on
> >> both 4.14.35 kernel or 6.4.0 kernel.
> >> Reproduce steps:
> >> 
> >> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
> >>   steps all aim at this XFS volume.
> > 
> > Actually, make that a few milliseconds.... :)
> 
> :)
> 
> > mkfs/xfs_info output would be appreciated.
> 
> sure,
> # xfs_info 20GB.bk2
> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
>          =                       sectsz=512   attr=2, projid32bit=1
>          =                       crc=0        finobt=0, sparse=0, rmapbt=0
>          =                       reflink=0

Hmmmm. Why are you only testing v4 filesystems? They are deprecated
and support is largely due to be dropped from upstream in 2025...

Does the same problem occur with a v5 filesystems?

> >> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
> >>   kernel and 6.4.0 kernel.
> > 
> > Which is is clearly wrong. It should be at least 360416 bytes (i.e
> > tr_itrunc), because that's what the EFI being processed that pins
> > the tail of the log is supposed to have reserved when it was
> > stalled.
> 
> Yep, exactly.
> 
> > So where has the ~180kB of leaked space come from?
> > 
> > Have you traced the grant head reservations to find out
> > what the runtime log space and grant head reservations actually are?
> I have the numbers in vmcore (ignore the WARNs),

That's not what I'm asking. You've dumped the values at the time of
the hang, not traced the runtime reservations that have been made.

> > i.e. we have full tracing of the log reservation accounting via
> > tracepoints in the kernel. If there is a leak occurring, you need to
> > capture a trace of all the reservation accounting operations and
> > post process the output to find out what operation is leaking
> > reserved space. e.g.
> > 
> > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> > ....
> > # trace-cmd report > s.t
> > # head -3 s.t
> > cpus=16
> >          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> >          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > 
> > #
> > 
> > So this tells us the transaction reservation unit size, the count of
> > reservations, the current reserve and grant head locations, and the
> > current head and tail of the log at the time the transaction
> > reservation is started and then after it completes.
> 
> Will do that and report back. You want full log or only some typical
> ones? Full log would be big, how shall I share? 

I don't want to see the log. It'll be huge - I regularly generate
traces containing gigabytes of log accounting traces like this from
a single workload.

What I'm asking you to do is run the tracing and then post process
the values from the trace to determine what operation is using more
space than is being freed back to the log.

I generally do this with grep, awk and sed. some people use python
or perl. But either way it's a *lot* of work - in the past I have
spent _weeks_ on trace analysis to find a 4 byte leak in the log
space accounting. DOing things like graphing the head, tail and grant
spaces over time tend to show if this is a gradual leak versus a
sudden step change. If it's a sudden step change, then you can
isolate it in the trace and work out what happened. If it's a
gradual change, then you need to start looking for accounting
discrepancies...

e.g. a transaction records 32 bytes used in the item, so it releases
t_unit - 32 bytes at commit. However, the CIL may then only track 28
bytes of space for the item in the journal and we leak 4 bytes of
reservation on every on of those items committed.

These sorts of leaks typically only add up to being somethign
significant in situations where the log is flooded with tiny inode
timestamp changes - 4 bytes iper item doesn't really matter when you
only have a few thousand items in the log, but when you have
hundreds of thousands of tiny items in the log...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-26  4:08             ` Dave Chinner
@ 2023-07-26 15:23               ` Darrick J. Wong
  2023-07-27  1:05                 ` Dave Chinner
  2023-07-28 17:56               ` Wengang Wang
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2023-07-26 15:23 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S

On Wed, Jul 26, 2023 at 02:08:05PM +1000, Dave Chinner wrote:
> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
> > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> > >> FYI:
> > >> 
> > >> I am able reproduce the XFS mount hang issue with hacked kernels based on
> > >> both 4.14.35 kernel or 6.4.0 kernel.
> > >> Reproduce steps:
> > >> 
> > >> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
> > >>   steps all aim at this XFS volume.
> > > 
> > > Actually, make that a few milliseconds.... :)
> > 
> > :)
> > 
> > > mkfs/xfs_info output would be appreciated.
> > 
> > sure,
> > # xfs_info 20GB.bk2
> > meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
> >          =                       sectsz=512   attr=2, projid32bit=1
> >          =                       crc=0        finobt=0, sparse=0, rmapbt=0
> >          =                       reflink=0
> 
> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
> and support is largely due to be dropped from upstream in 2025...
> 
> Does the same problem occur with a v5 filesystems?
> 
> > >> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
> > >>   kernel and 6.4.0 kernel.
> > > 
> > > Which is is clearly wrong. It should be at least 360416 bytes (i.e
> > > tr_itrunc), because that's what the EFI being processed that pins
> > > the tail of the log is supposed to have reserved when it was
> > > stalled.
> > 
> > Yep, exactly.
> > 
> > > So where has the ~180kB of leaked space come from?
> > > 
> > > Have you traced the grant head reservations to find out
> > > what the runtime log space and grant head reservations actually are?
> > I have the numbers in vmcore (ignore the WARNs),
> 
> That's not what I'm asking. You've dumped the values at the time of
> the hang, not traced the runtime reservations that have been made.
> 
> > > i.e. we have full tracing of the log reservation accounting via
> > > tracepoints in the kernel. If there is a leak occurring, you need to
> > > capture a trace of all the reservation accounting operations and
> > > post process the output to find out what operation is leaking
> > > reserved space. e.g.
> > > 
> > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> > > ....
> > > # trace-cmd report > s.t
> > > # head -3 s.t
> > > cpus=16
> > >          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > >          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > > 
> > > #
> > > 
> > > So this tells us the transaction reservation unit size, the count of
> > > reservations, the current reserve and grant head locations, and the
> > > current head and tail of the log at the time the transaction
> > > reservation is started and then after it completes.
> > 
> > Will do that and report back. You want full log or only some typical
> > ones? Full log would be big, how shall I share? 
> 
> I don't want to see the log. It'll be huge - I regularly generate
> traces containing gigabytes of log accounting traces like this from
> a single workload.
> 
> What I'm asking you to do is run the tracing and then post process
> the values from the trace to determine what operation is using more
> space than is being freed back to the log.
> 
> I generally do this with grep, awk and sed. some people use python
> or perl. But either way it's a *lot* of work - in the past I have
> spent _weeks_ on trace analysis to find a 4 byte leak in the log
> space accounting. DOing things like graphing the head, tail and grant
> spaces over time tend to show if this is a gradual leak versus a
> sudden step change. If it's a sudden step change, then you can
> isolate it in the trace and work out what happened. If it's a
> gradual change, then you need to start looking for accounting
> discrepancies...

Any chance you'd be willing to share that pipeline?  It'd be useful to
stash that kind of debugging program in xfsprogs/tools to save time and
eliminate an entire class of "Hey Dave, did I capture this correctly?"
questions.

(At least until someone changes the tracepoints :P)

--D

> e.g. a transaction records 32 bytes used in the item, so it releases
> t_unit - 32 bytes at commit. However, the CIL may then only track 28
> bytes of space for the item in the journal and we leak 4 bytes of
> reservation on every on of those items committed.
> 
> These sorts of leaks typically only add up to being somethign
> significant in situations where the log is flooded with tiny inode
> timestamp changes - 4 bytes iper item doesn't really matter when you
> only have a few thousand items in the log, but when you have
> hundreds of thousands of tiny items in the log...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-26 15:23               ` Darrick J. Wong
@ 2023-07-27  1:05                 ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2023-07-27  1:05 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S

On Wed, Jul 26, 2023 at 08:23:20AM -0700, Darrick J. Wong wrote:
> On Wed, Jul 26, 2023 at 02:08:05PM +1000, Dave Chinner wrote:
> > On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
> > > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> > > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> > > > i.e. we have full tracing of the log reservation accounting via
> > > > tracepoints in the kernel. If there is a leak occurring, you need to
> > > > capture a trace of all the reservation accounting operations and
> > > > post process the output to find out what operation is leaking
> > > > reserved space. e.g.
> > > > 
> > > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> > > > ....
> > > > # trace-cmd report > s.t
> > > > # head -3 s.t
> > > > cpus=16
> > > >          touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > > >          touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> > > > 
> > > > #
> > > > 
> > > > So this tells us the transaction reservation unit size, the count of
> > > > reservations, the current reserve and grant head locations, and the
> > > > current head and tail of the log at the time the transaction
> > > > reservation is started and then after it completes.
> > > 
> > > Will do that and report back. You want full log or only some typical
> > > ones? Full log would be big, how shall I share? 
> > 
> > I don't want to see the log. It'll be huge - I regularly generate
> > traces containing gigabytes of log accounting traces like this from
> > a single workload.
> > 
> > What I'm asking you to do is run the tracing and then post process
> > the values from the trace to determine what operation is using more
> > space than is being freed back to the log.
> > 
> > I generally do this with grep, awk and sed. some people use python
> > or perl. But either way it's a *lot* of work - in the past I have
> > spent _weeks_ on trace analysis to find a 4 byte leak in the log
> > space accounting. DOing things like graphing the head, tail and grant
> > spaces over time tend to show if this is a gradual leak versus a
> > sudden step change. If it's a sudden step change, then you can
> > isolate it in the trace and work out what happened. If it's a
> > gradual change, then you need to start looking for accounting
> > discrepancies...
> 
> Any chance you'd be willing to share that pipeline?

It's all just one-off scripts I tend to write from scratch each
time. I'm always looking for different things, and I never know what
I'm looking for ahead of time, so the parser is always different....

> It'd be useful to
> stash that kind of debugging program in xfsprogs/tools to save time and
> eliminate an entire class of "Hey Dave, did I capture this correctly?"
> questions.
> 
> (At least until someone changes the tracepoints :P)

That happens frequently enough... :/

Regardless, it's the process of writing filter scripts that helps me
find the problem I'm looking for, not the actual filter script
itself.

e.g. I've been working on a similar "EFI reservation deadlocks on
log space during recovery" issue over the past couple of days from a
customer metadump.

The filesystem had a 2GB log, and it was full. Log recovery took a
several minutes to run there were so many items to recover. The
(incomplete) logprint of all the op headers was over a million lines
(somewhere around 400k individual log items) and I know that
logprint output skipped at least two checkpoints that had over
65,000 ophdrs in them each.

So log recovery was recovering roughly half a million log items from
the log, and only then it was hanging on the first EFI that needed
to be recovered. The mount trace was somewhere around 20GB of event
data at the point of the hang, the logprint output indicated about
1.3 million ophdrs in the log that were being processed by recovery.
I worked that out to be around half a million individual log items
in the log that needed recovery.

In setting up a regex to filter the trace down to something
managable, I noticed that the id of the EFI that hung recovery
matched an EFI in the very first checkpoint being recovered from the
log (from the logprint output). Logprint told me that this
checkpoint was almost entirely EFI/EFDs from an unlinked inode being
inactivated. The last two log items in that checkpoint were the
inode item and a single EFI. 

A quick grep failed to find that inode anywhere else in the
logprint, and the EFI id didn't show up in any EFD item, either.
Pulling them out of the mount trace confirmed that they only
appeared in recovery of the initial checkpoint. IOWs, the EFI -
and the inode that was locked across the processing of the EFI at
runtime - had pinned the tail of the log and that's why recovery
hung waiting on log space.

It turns out that the hang in log recovery followed a system
reboot due to a hung task timeout. So right now, it looks to me like
the runtime code had deadlocked the journal because transaction
reservation during a roll couldn't make progress because the inode
held locked across the transaction roll pinned the tail of the
log....

How that happened, I have no idea, but it was the _process_ of
writing filters to analysis the logprint data and log reservation
traces that allowed me to find this, not because I had a set of
canned scripts that exposed that explicit information. Even if I
write a scanned script to find unmatched EFIs in a log print, the
log print output is sufficiently unreliable that I couldn't rely on
the output of the filter script anyway.....

Fundamentally, I don't have the time to run this trace+filter
process this for every report of similar problems - it takes a long
time just to work out what what I need to look for in the traces.
Hence it's important that more people learn the process - how to
read and filter the traces to isolate something that might be going
wrong. I don't have canned scripts because the scripts I write are
guided by what I think I need to search for, and I regularly get to
a dead end and have to start the filtering process all over
again to look for something else entirely.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-26  4:08             ` Dave Chinner
  2023-07-26 15:23               ` Darrick J. Wong
@ 2023-07-28 17:56               ` Wengang Wang
  2023-08-18  3:25                 ` Wengang Wang
  1 sibling, 1 reply; 22+ messages in thread
From: Wengang Wang @ 2023-07-28 17:56 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S



> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
>>>> FYI:
>>>> 
>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on
>>>> both 4.14.35 kernel or 6.4.0 kernel.
>>>> Reproduce steps:
>>>> 
>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>>>>  steps all aim at this XFS volume.
>>> 
>>> Actually, make that a few milliseconds.... :)
>> 
>> :)
>> 
>>> mkfs/xfs_info output would be appreciated.
>> 
>> sure,
>> # xfs_info 20GB.bk2
>> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
>>         =                       sectsz=512   attr=2, projid32bit=1
>>         =                       crc=0        finobt=0, sparse=0, rmapbt=0
>>         =                       reflink=0
> 
> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
> and support is largely due to be dropped from upstream in 2025...
> 

Ha, it just happened to be so.
I was trying to reproduce it in the same environment as customer has —
that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format
v4 filesystems.  I created the xfs image in a file on OL7 and copied the image
file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here.

> Does the same problem occur with a v5 filesystems?

Will test and report back.

> 
>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>>>>  kernel and 6.4.0 kernel.
>>> 
>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e
>>> tr_itrunc), because that's what the EFI being processed that pins
>>> the tail of the log is supposed to have reserved when it was
>>> stalled.
>> 
>> Yep, exactly.
>> 
>>> So where has the ~180kB of leaked space come from?
>>> 
>>> Have you traced the grant head reservations to find out
>>> what the runtime log space and grant head reservations actually are?
>> I have the numbers in vmcore (ignore the WARNs),
> 
> That's not what I'm asking. You've dumped the values at the time of
> the hang, not traced the runtime reservations that have been made.
> 
>>> i.e. we have full tracing of the log reservation accounting via
>>> tracepoints in the kernel. If there is a leak occurring, you need to
>>> capture a trace of all the reservation accounting operations and
>>> post process the output to find out what operation is leaking
>>> reserved space. e.g.
>>> 
>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
>>> ....
>>> # trace-cmd report > s.t
>>> # head -3 s.t
>>> cpus=16
>>>         touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>         touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>> 
>>> #
>>> 
>>> So this tells us the transaction reservation unit size, the count of
>>> reservations, the current reserve and grant head locations, and the
>>> current head and tail of the log at the time the transaction
>>> reservation is started and then after it completes.
>> 
>> Will do that and report back. You want full log or only some typical
>> ones? Full log would be big, how shall I share? 
> 
> I don't want to see the log. It'll be huge - I regularly generate
> traces containing gigabytes of log accounting traces like this from
> a single workload.
> 
> What I'm asking you to do is run the tracing and then post process
> the values from the trace to determine what operation is using more
> space than is being freed back to the log.
> 
> I generally do this with grep, awk and sed. some people use python
> or perl. But either way it's a *lot* of work - in the past I have
> spent _weeks_ on trace analysis to find a 4 byte leak in the log
> space accounting. DOing things like graphing the head, tail and grant
> spaces over time tend to show if this is a gradual leak versus a
> sudden step change. If it's a sudden step change, then you can
> isolate it in the trace and work out what happened. If it's a
> gradual change, then you need to start looking for accounting
> discrepancies...
> 
> e.g. a transaction records 32 bytes used in the item, so it releases
> t_unit - 32 bytes at commit. However, the CIL may then only track 28
> bytes of space for the item in the journal and we leak 4 bytes of
> reservation on every on of those items committed.
> 
> These sorts of leaks typically only add up to being somethign
> significant in situations where the log is flooded with tiny inode
> timestamp changes - 4 bytes iper item doesn't really matter when you
> only have a few thousand items in the log, but when you have
> hundreds of thousands of tiny items in the log...
> 

OK. will work more on this.
# I am going to start a two-week vacation, and will then continue on this when back.

thanks,
wengang


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-28 17:56               ` Wengang Wang
@ 2023-08-18  3:25                 ` Wengang Wang
  2023-08-21 22:06                   ` Wengang Wang
  2023-08-24  4:52                   ` Darrick J. Wong
  0 siblings, 2 replies; 22+ messages in thread
From: Wengang Wang @ 2023-08-18  3:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S

What things looks to be:

For the file deletion.  log bytes are reserved basing on xfs_mount->tr_itruncate which is:
{
    tr_logres = 175488,
    tr_logcount = 2,
    tr_logflags = XFS_TRANS_PERM_LOG_RES,
}  
You see it’s permanent log reservation with two log operations (two transactions in rolling mode).
After calculation (xlog_calc_unit_res(), adding space for various log headers), the final
log space needed per transaction changes from  175488 to 180208 bytes. So the total
log space needed is 360416 (180208 * 2).   
Above log space (360416 bytes) needs to be reserved for both run time inode removing
(xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()).

RUNTIME
========

For run time inode removing. The first transaction is mainly used for inode fields change.
The second transaction is used for intents including extent freeing.

For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved
for the coming transaction).  
The first transaction is committed, writing some bytes to log and releasing the left reserved bytes.

Now we have the second transaction which has 180208 log bytes reserved too. The second
transaction is supposed to process intents including extent freeing. With my hacking patch,
I blocked the extent freeing 5 hours. So in that 5 hours,  180208 (NOT 360416) log bytes are reserved.

With my test case, other transactions (update timestamps) then happen. As my hacking patch
pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available
log space (in memory in on disk).  And finally the on disk (and in memory) available log space
goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free)
transaction.

Panic the kernel and remount the xfs volume

LOG RECOVER
=============

With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs
360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available.

THINKING
========
Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd
transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes.  We have
(a bit) more than 180208 available log bytes  on disk, so the reservation goes and the recovery goes.
That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res
xfs_mount->tr_ext_free
{
    tr_logres = 175488,
    tr_logcount = 0,
    tr_logflags = 0,
}
and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it).

thanks,
wengang

> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> 
> 
>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
>> 
>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
>>>>> FYI:
>>>>> 
>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on
>>>>> both 4.14.35 kernel or 6.4.0 kernel.
>>>>> Reproduce steps:
>>>>> 
>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>>>>> steps all aim at this XFS volume.
>>>> 
>>>> Actually, make that a few milliseconds.... :)
>>> 
>>> :)
>>> 
>>>> mkfs/xfs_info output would be appreciated.
>>> 
>>> sure,
>>> # xfs_info 20GB.bk2
>>> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
>>>        =                       sectsz=512   attr=2, projid32bit=1
>>>        =                       crc=0        finobt=0, sparse=0, rmapbt=0
>>>        =                       reflink=0
>> 
>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
>> and support is largely due to be dropped from upstream in 2025...
>> 
> 
> Ha, it just happened to be so.
> I was trying to reproduce it in the same environment as customer has —
> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format
> v4 filesystems.  I created the xfs image in a file on OL7 and copied the image
> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here.
> 
>> Does the same problem occur with a v5 filesystems?
> 
> Will test and report back.
> 
>> 
>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>>>>> kernel and 6.4.0 kernel.
>>>> 
>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e
>>>> tr_itrunc), because that's what the EFI being processed that pins
>>>> the tail of the log is supposed to have reserved when it was
>>>> stalled.
>>> 
>>> Yep, exactly.
>>> 
>>>> So where has the ~180kB of leaked space come from?
>>>> 
>>>> Have you traced the grant head reservations to find out
>>>> what the runtime log space and grant head reservations actually are?
>>> I have the numbers in vmcore (ignore the WARNs),
>> 
>> That's not what I'm asking. You've dumped the values at the time of
>> the hang, not traced the runtime reservations that have been made.
>> 
>>>> i.e. we have full tracing of the log reservation accounting via
>>>> tracepoints in the kernel. If there is a leak occurring, you need to
>>>> capture a trace of all the reservation accounting operations and
>>>> post process the output to find out what operation is leaking
>>>> reserved space. e.g.
>>>> 
>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
>>>> ....
>>>> # trace-cmd report > s.t
>>>> # head -3 s.t
>>>> cpus=16
>>>>        touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>>        touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>> 
>>>> #
>>>> 
>>>> So this tells us the transaction reservation unit size, the count of
>>>> reservations, the current reserve and grant head locations, and the
>>>> current head and tail of the log at the time the transaction
>>>> reservation is started and then after it completes.
>>> 
>>> Will do that and report back. You want full log or only some typical
>>> ones? Full log would be big, how shall I share? 
>> 
>> I don't want to see the log. It'll be huge - I regularly generate
>> traces containing gigabytes of log accounting traces like this from
>> a single workload.
>> 
>> What I'm asking you to do is run the tracing and then post process
>> the values from the trace to determine what operation is using more
>> space than is being freed back to the log.
>> 
>> I generally do this with grep, awk and sed. some people use python
>> or perl. But either way it's a *lot* of work - in the past I have
>> spent _weeks_ on trace analysis to find a 4 byte leak in the log
>> space accounting. DOing things like graphing the head, tail and grant
>> spaces over time tend to show if this is a gradual leak versus a
>> sudden step change. If it's a sudden step change, then you can
>> isolate it in the trace and work out what happened. If it's a
>> gradual change, then you need to start looking for accounting
>> discrepancies...
>> 
>> e.g. a transaction records 32 bytes used in the item, so it releases
>> t_unit - 32 bytes at commit. However, the CIL may then only track 28
>> bytes of space for the item in the journal and we leak 4 bytes of
>> reservation on every on of those items committed.
>> 
>> These sorts of leaks typically only add up to being somethign
>> significant in situations where the log is flooded with tiny inode
>> timestamp changes - 4 bytes iper item doesn't really matter when you
>> only have a few thousand items in the log, but when you have
>> hundreds of thousands of tiny items in the log...
>> 
> 
> OK. will work more on this.
> # I am going to start a two-week vacation, and will then continue on this when back.
> 
> thanks,
> wengang



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-18  3:25                 ` Wengang Wang
@ 2023-08-21 22:06                   ` Wengang Wang
  2023-08-24  5:05                     ` Darrick J. Wong
  2023-08-24  4:52                   ` Darrick J. Wong
  1 sibling, 1 reply; 22+ messages in thread
From: Wengang Wang @ 2023-08-21 22:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S



> On Aug 17, 2023, at 8:25 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> 
> What things looks to be:
> 
> For the file deletion.  log bytes are reserved basing on xfs_mount->tr_itruncate which is:
> {
>    tr_logres = 175488,
>    tr_logcount = 2,
>    tr_logflags = XFS_TRANS_PERM_LOG_RES,
> }  
> You see it’s permanent log reservation with two log operations (two transactions in rolling mode).
> After calculation (xlog_calc_unit_res(), adding space for various log headers), the final
> log space needed per transaction changes from  175488 to 180208 bytes. So the total
> log space needed is 360416 (180208 * 2).   
> Above log space (360416 bytes) needs to be reserved for both run time inode removing
> (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()).
> 
> RUNTIME
> ========
> 
> For run time inode removing. The first transaction is mainly used for inode fields change.
> The second transaction is used for intents including extent freeing.
> 
> For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved
> for the coming transaction).  
> The first transaction is committed, writing some bytes to log and releasing the left reserved bytes.
> 
> Now we have the second transaction which has 180208 log bytes reserved too. The second
> transaction is supposed to process intents including extent freeing. With my hacking patch,
> I blocked the extent freeing 5 hours. So in that 5 hours,  180208 (NOT 360416) log bytes are reserved.
> 
> With my test case, other transactions (update timestamps) then happen. As my hacking patch
> pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available
> log space (in memory in on disk).  And finally the on disk (and in memory) available log space
> goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free)
> transaction.
> 
> Panic the kernel and remount the xfs volume
> 
> LOG RECOVER
> =============
> 
> With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs
> 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available.
> 
> THINKING
> ========
> Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd
> transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes.  We have
> (a bit) more than 180208 available log bytes  on disk, so the reservation goes and the recovery goes.
> That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res
> xfs_mount->tr_ext_free
> {
>    tr_logres = 175488,
>    tr_logcount = 0,
>    tr_logflags = 0,
> }
> and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it).
> 

The following patch recovers the problematic XFS volume by my hacked kernel and the also
the one from customer.

commit 19fad903e213717a92f8b94fe2c0c68b6a6ee7f7 (HEAD -> 35587163_fix)
Author: Wengang Wang <wen.gang.wang@oracle.com>
Date:   Mon Aug 21 15:03:58 2023 -0700

    xfs: reserve less log space when recovering EFIs

    Currently tr_itruncate is used for both run time truncating and
    boot time EFI recovery. tr_itruncate
    {
       tr_logres = 175488,
       tr_logcount = 2,
       tr_logflags = XFS_TRANS_PERM_LOG_RES,
    }

    Is a permanent two-transaction series. Actually only the second transaction
    is really used to free extents and that needs half of the log space reservation
    from tr_itruncate.

    For EFI recovery, the things to do is just to free extents, so it doesn't
    needs full log space reservation by tr_itruncate. It needs half of it and
    shouldn't need more than that.

    Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f1a5ecf099aa..428984e48d23 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -667,6 +667,7 @@ xfs_efi_item_recover(
        int                             i;
        int                             error = 0;
        bool                            requeue_only = false;
+       struct xfs_trans_res            tres;

        /*
         * First check the validity of the extents described by the
@@ -683,7 +684,10 @@ xfs_efi_item_recover(
                }
        }

-       error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+       tres.tr_logres = M_RES(mp)->tr_itruncate.tr_logres;
+       tres.tr_logcount = 0;
+       tres.tr_logflags = 0;
+       error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
        if (error)
                return error;
        efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);

thanks,
wengang

> thanks,
> wengang
> 
>> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>> 
>> 
>> 
>>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
>>> 
>>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
>>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
>>>>>> FYI:
>>>>>> 
>>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on
>>>>>> both 4.14.35 kernel or 6.4.0 kernel.
>>>>>> Reproduce steps:
>>>>>> 
>>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>>>>>> steps all aim at this XFS volume.
>>>>> 
>>>>> Actually, make that a few milliseconds.... :)
>>>> 
>>>> :)
>>>> 
>>>>> mkfs/xfs_info output would be appreciated.
>>>> 
>>>> sure,
>>>> # xfs_info 20GB.bk2
>>>> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
>>>>       =                       sectsz=512   attr=2, projid32bit=1
>>>>       =                       crc=0        finobt=0, sparse=0, rmapbt=0
>>>>       =                       reflink=0
>>> 
>>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
>>> and support is largely due to be dropped from upstream in 2025...
>>> 
>> 
>> Ha, it just happened to be so.
>> I was trying to reproduce it in the same environment as customer has —
>> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format
>> v4 filesystems.  I created the xfs image in a file on OL7 and copied the image
>> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here.
>> 
>>> Does the same problem occur with a v5 filesystems?
>> 
>> Will test and report back.
>> 
>>> 
>>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>>>>>> kernel and 6.4.0 kernel.
>>>>> 
>>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e
>>>>> tr_itrunc), because that's what the EFI being processed that pins
>>>>> the tail of the log is supposed to have reserved when it was
>>>>> stalled.
>>>> 
>>>> Yep, exactly.
>>>> 
>>>>> So where has the ~180kB of leaked space come from?
>>>>> 
>>>>> Have you traced the grant head reservations to find out
>>>>> what the runtime log space and grant head reservations actually are?
>>>> I have the numbers in vmcore (ignore the WARNs),
>>> 
>>> That's not what I'm asking. You've dumped the values at the time of
>>> the hang, not traced the runtime reservations that have been made.
>>> 
>>>>> i.e. we have full tracing of the log reservation accounting via
>>>>> tracepoints in the kernel. If there is a leak occurring, you need to
>>>>> capture a trace of all the reservation accounting operations and
>>>>> post process the output to find out what operation is leaking
>>>>> reserved space. e.g.
>>>>> 
>>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
>>>>> ....
>>>>> # trace-cmd report > s.t
>>>>> # head -3 s.t
>>>>> cpus=16
>>>>>       touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>>>       touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>>> 
>>>>> #
>>>>> 
>>>>> So this tells us the transaction reservation unit size, the count of
>>>>> reservations, the current reserve and grant head locations, and the
>>>>> current head and tail of the log at the time the transaction
>>>>> reservation is started and then after it completes.
>>>> 
>>>> Will do that and report back. You want full log or only some typical
>>>> ones? Full log would be big, how shall I share? 
>>> 
>>> I don't want to see the log. It'll be huge - I regularly generate
>>> traces containing gigabytes of log accounting traces like this from
>>> a single workload.
>>> 
>>> What I'm asking you to do is run the tracing and then post process
>>> the values from the trace to determine what operation is using more
>>> space than is being freed back to the log.
>>> 
>>> I generally do this with grep, awk and sed. some people use python
>>> or perl. But either way it's a *lot* of work - in the past I have
>>> spent _weeks_ on trace analysis to find a 4 byte leak in the log
>>> space accounting. DOing things like graphing the head, tail and grant
>>> spaces over time tend to show if this is a gradual leak versus a
>>> sudden step change. If it's a sudden step change, then you can
>>> isolate it in the trace and work out what happened. If it's a
>>> gradual change, then you need to start looking for accounting
>>> discrepancies...
>>> 
>>> e.g. a transaction records 32 bytes used in the item, so it releases
>>> t_unit - 32 bytes at commit. However, the CIL may then only track 28
>>> bytes of space for the item in the journal and we leak 4 bytes of
>>> reservation on every on of those items committed.
>>> 
>>> These sorts of leaks typically only add up to being somethign
>>> significant in situations where the log is flooded with tiny inode
>>> timestamp changes - 4 bytes iper item doesn't really matter when you
>>> only have a few thousand items in the log, but when you have
>>> hundreds of thousands of tiny items in the log...
>>> 
>> 
>> OK. will work more on this.
>> # I am going to start a two-week vacation, and will then continue on this when back.
>> 
>> thanks,
>> wengang
> 
> 


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-21 22:06                   ` Wengang Wang
@ 2023-08-24  5:05                     ` Darrick J. Wong
  2023-08-24 22:55                       ` Wengang Wang
  0 siblings, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2023-08-24  5:05 UTC (permalink / raw)
  To: Wengang Wang; +Cc: Dave Chinner, linux-xfs@vger.kernel.org, Srikanth C S

On Mon, Aug 21, 2023 at 10:06:45PM +0000, Wengang Wang wrote:
> 
> 
> > On Aug 17, 2023, at 8:25 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> > 
> > What things looks to be:
> > 
> > For the file deletion.  log bytes are reserved basing on xfs_mount->tr_itruncate which is:
> > {
> >    tr_logres = 175488,
> >    tr_logcount = 2,
> >    tr_logflags = XFS_TRANS_PERM_LOG_RES,
> > }  
> > You see it’s permanent log reservation with two log operations (two transactions in rolling mode).
> > After calculation (xlog_calc_unit_res(), adding space for various log headers), the final
> > log space needed per transaction changes from  175488 to 180208 bytes. So the total
> > log space needed is 360416 (180208 * 2).   
> > Above log space (360416 bytes) needs to be reserved for both run time inode removing
> > (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()).
> > 
> > RUNTIME
> > ========
> > 
> > For run time inode removing. The first transaction is mainly used for inode fields change.
> > The second transaction is used for intents including extent freeing.
> > 
> > For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved
> > for the coming transaction).  
> > The first transaction is committed, writing some bytes to log and releasing the left reserved bytes.
> > 
> > Now we have the second transaction which has 180208 log bytes reserved too. The second
> > transaction is supposed to process intents including extent freeing. With my hacking patch,
> > I blocked the extent freeing 5 hours. So in that 5 hours,  180208 (NOT 360416) log bytes are reserved.
> > 
> > With my test case, other transactions (update timestamps) then happen. As my hacking patch
> > pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available
> > log space (in memory in on disk).  And finally the on disk (and in memory) available log space
> > goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free)
> > transaction.
> > 
> > Panic the kernel and remount the xfs volume
> > 
> > LOG RECOVER
> > =============
> > 
> > With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs
> > 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available.
> > 
> > THINKING
> > ========
> > Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd
> > transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes.  We have
> > (a bit) more than 180208 available log bytes  on disk, so the reservation goes and the recovery goes.
> > That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res
> > xfs_mount->tr_ext_free
> > {
> >    tr_logres = 175488,
> >    tr_logcount = 0,
> >    tr_logflags = 0,
> > }
> > and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it).
> > 
> 
> The following patch recovers the problematic XFS volume by my hacked kernel and the also
> the one from customer.
> 
> commit 19fad903e213717a92f8b94fe2c0c68b6a6ee7f7 (HEAD -> 35587163_fix)
> Author: Wengang Wang <wen.gang.wang@oracle.com>
> Date:   Mon Aug 21 15:03:58 2023 -0700
> 
>     xfs: reserve less log space when recovering EFIs
> 
>     Currently tr_itruncate is used for both run time truncating and
>     boot time EFI recovery. tr_itruncate
>     {
>        tr_logres = 175488,
>        tr_logcount = 2,
>        tr_logflags = XFS_TRANS_PERM_LOG_RES,
>     }
> 
>     Is a permanent two-transaction series. Actually only the second transaction
>     is really used to free extents and that needs half of the log space reservation
>     from tr_itruncate.
> 
>     For EFI recovery, the things to do is just to free extents, so it doesn't
>     needs full log space reservation by tr_itruncate. It needs half of it and
>     shouldn't need more than that.
> 
>     Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
> 
> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
> index f1a5ecf099aa..428984e48d23 100644
> --- a/fs/xfs/xfs_extfree_item.c
> +++ b/fs/xfs/xfs_extfree_item.c
u> @@ -667,6 +667,7 @@ xfs_efi_item_recover(
>         int                             i;
>         int                             error = 0;
>         bool                            requeue_only = false;
> +       struct xfs_trans_res            tres;
> 
>         /*
>          * First check the validity of the extents described by the
> @@ -683,7 +684,10 @@ xfs_efi_item_recover(
>                 }
>         }
> 
> -       error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +       tres.tr_logres = M_RES(mp)->tr_itruncate.tr_logres;
> +       tres.tr_logcount = 0;

HAH that was fast.  I'm glad it worked.

> +       tres.tr_logflags = 0;

Though I think we should preserve XFS_TRANS_PERM_LOG_RES since other
parts of the codebase check that the flag was conveyed from the
transaction reservation into tp->t_flags itself.

I think for an upstream patch we'd rather fix all of them, and in a
systemic way.  How about adding this to xfs_log_recover.h:

/*
 * Transform a regular reservation into one suitable for recovery of a
 * log intent item.
 *
 * Intent recovery only runs a single step of the transaction chain and
 * defers the rest to a separate transaction.  Therefore, we reduce
 * logcount to 1 here to avoid livelocks if the log grant space is
 * nearly exhausted due to the recovered intent pinning the tail.  Keep
 * the same logflags to avoid tripping asserts elsewhere.  Struct copies
 * abound below.
 */
static inline struct xfs_trans_res
xlog_recover_resv(const struct xfs_trans_res *template)
{
	struct xfs_trans_res ret = {
		.tr_logres	= template->tr_logres,
		.tr_logcount	= 1,
		.tr_logflags	= template->tr_logflags,
	};

	return ret;
}

and then this becomes:

	struct xfs_trans_res	resv;

	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
	error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);

which should simplify fixing this for the other recovery functions?

--D

> +       error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);

>         if (error)
>                 return error;
>         efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
> 
> thanks,
> wengang
> 
> > thanks,
> > wengang
> > 
> >> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> >> 
> >> 
> >> 
> >>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>> 
> >>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
> >>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> >>>>>> FYI:
> >>>>>> 
> >>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on
> >>>>>> both 4.14.35 kernel or 6.4.0 kernel.
> >>>>>> Reproduce steps:
> >>>>>> 
> >>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
> >>>>>> steps all aim at this XFS volume.
> >>>>> 
> >>>>> Actually, make that a few milliseconds.... :)
> >>>> 
> >>>> :)
> >>>> 
> >>>>> mkfs/xfs_info output would be appreciated.
> >>>> 
> >>>> sure,
> >>>> # xfs_info 20GB.bk2
> >>>> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
> >>>>       =                       sectsz=512   attr=2, projid32bit=1
> >>>>       =                       crc=0        finobt=0, sparse=0, rmapbt=0
> >>>>       =                       reflink=0
> >>> 
> >>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
> >>> and support is largely due to be dropped from upstream in 2025...
> >>> 
> >> 
> >> Ha, it just happened to be so.
> >> I was trying to reproduce it in the same environment as customer has —
> >> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format
> >> v4 filesystems.  I created the xfs image in a file on OL7 and copied the image
> >> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here.
> >> 
> >>> Does the same problem occur with a v5 filesystems?
> >> 
> >> Will test and report back.
> >> 
> >>> 
> >>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
> >>>>>> kernel and 6.4.0 kernel.
> >>>>> 
> >>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e
> >>>>> tr_itrunc), because that's what the EFI being processed that pins
> >>>>> the tail of the log is supposed to have reserved when it was
> >>>>> stalled.
> >>>> 
> >>>> Yep, exactly.
> >>>> 
> >>>>> So where has the ~180kB of leaked space come from?
> >>>>> 
> >>>>> Have you traced the grant head reservations to find out
> >>>>> what the runtime log space and grant head reservations actually are?
> >>>> I have the numbers in vmcore (ignore the WARNs),
> >>> 
> >>> That's not what I'm asking. You've dumped the values at the time of
> >>> the hang, not traced the runtime reservations that have been made.
> >>> 
> >>>>> i.e. we have full tracing of the log reservation accounting via
> >>>>> tracepoints in the kernel. If there is a leak occurring, you need to
> >>>>> capture a trace of all the reservation accounting operations and
> >>>>> post process the output to find out what operation is leaking
> >>>>> reserved space. e.g.
> >>>>> 
> >>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> >>>>> ....
> >>>>> # trace-cmd report > s.t
> >>>>> # head -3 s.t
> >>>>> cpus=16
> >>>>>       touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> >>>>>       touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> >>>>> 
> >>>>> #
> >>>>> 
> >>>>> So this tells us the transaction reservation unit size, the count of
> >>>>> reservations, the current reserve and grant head locations, and the
> >>>>> current head and tail of the log at the time the transaction
> >>>>> reservation is started and then after it completes.
> >>>> 
> >>>> Will do that and report back. You want full log or only some typical
> >>>> ones? Full log would be big, how shall I share? 
> >>> 
> >>> I don't want to see the log. It'll be huge - I regularly generate
> >>> traces containing gigabytes of log accounting traces like this from
> >>> a single workload.
> >>> 
> >>> What I'm asking you to do is run the tracing and then post process
> >>> the values from the trace to determine what operation is using more
> >>> space than is being freed back to the log.
> >>> 
> >>> I generally do this with grep, awk and sed. some people use python
> >>> or perl. But either way it's a *lot* of work - in the past I have
> >>> spent _weeks_ on trace analysis to find a 4 byte leak in the log
> >>> space accounting. DOing things like graphing the head, tail and grant
> >>> spaces over time tend to show if this is a gradual leak versus a
> >>> sudden step change. If it's a sudden step change, then you can
> >>> isolate it in the trace and work out what happened. If it's a
> >>> gradual change, then you need to start looking for accounting
> >>> discrepancies...
> >>> 
> >>> e.g. a transaction records 32 bytes used in the item, so it releases
> >>> t_unit - 32 bytes at commit. However, the CIL may then only track 28
> >>> bytes of space for the item in the journal and we leak 4 bytes of
> >>> reservation on every on of those items committed.
> >>> 
> >>> These sorts of leaks typically only add up to being somethign
> >>> significant in situations where the log is flooded with tiny inode
> >>> timestamp changes - 4 bytes iper item doesn't really matter when you
> >>> only have a few thousand items in the log, but when you have
> >>> hundreds of thousands of tiny items in the log...
> >>> 
> >> 
> >> OK. will work more on this.
> >> # I am going to start a two-week vacation, and will then continue on this when back.
> >> 
> >> thanks,
> >> wengang
> > 
> > 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-24  5:05                     ` Darrick J. Wong
@ 2023-08-24 22:55                       ` Wengang Wang
  0 siblings, 0 replies; 22+ messages in thread
From: Wengang Wang @ 2023-08-24 22:55 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs@vger.kernel.org, Srikanth C S



> On Aug 23, 2023, at 10:05 PM, Darrick J. Wong <djwong@kernel.org> wrote:
> 
> On Mon, Aug 21, 2023 at 10:06:45PM +0000, Wengang Wang wrote:
>> 
>> 
>>> On Aug 17, 2023, at 8:25 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>> 
>>> What things looks to be:
>>> 
>>> For the file deletion.  log bytes are reserved basing on xfs_mount->tr_itruncate which is:
>>> {
>>>   tr_logres = 175488,
>>>   tr_logcount = 2,
>>>   tr_logflags = XFS_TRANS_PERM_LOG_RES,
>>> }  
>>> You see it’s permanent log reservation with two log operations (two transactions in rolling mode).
>>> After calculation (xlog_calc_unit_res(), adding space for various log headers), the final
>>> log space needed per transaction changes from  175488 to 180208 bytes. So the total
>>> log space needed is 360416 (180208 * 2).   
>>> Above log space (360416 bytes) needs to be reserved for both run time inode removing
>>> (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()).
>>> 
>>> RUNTIME
>>> ========
>>> 
>>> For run time inode removing. The first transaction is mainly used for inode fields change.
>>> The second transaction is used for intents including extent freeing.
>>> 
>>> For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved
>>> for the coming transaction).  
>>> The first transaction is committed, writing some bytes to log and releasing the left reserved bytes.
>>> 
>>> Now we have the second transaction which has 180208 log bytes reserved too. The second
>>> transaction is supposed to process intents including extent freeing. With my hacking patch,
>>> I blocked the extent freeing 5 hours. So in that 5 hours,  180208 (NOT 360416) log bytes are reserved.
>>> 
>>> With my test case, other transactions (update timestamps) then happen. As my hacking patch
>>> pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available
>>> log space (in memory in on disk).  And finally the on disk (and in memory) available log space
>>> goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free)
>>> transaction.
>>> 
>>> Panic the kernel and remount the xfs volume
>>> 
>>> LOG RECOVER
>>> =============
>>> 
>>> With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs
>>> 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available.
>>> 
>>> THINKING
>>> ========
>>> Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd
>>> transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes.  We have
>>> (a bit) more than 180208 available log bytes  on disk, so the reservation goes and the recovery goes.
>>> That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res
>>> xfs_mount->tr_ext_free
>>> {
>>>   tr_logres = 175488,
>>>   tr_logcount = 0,
>>>   tr_logflags = 0,
>>> }
>>> and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it).
>>> 
>> 
>> The following patch recovers the problematic XFS volume by my hacked kernel and the also
>> the one from customer.
>> 
>> commit 19fad903e213717a92f8b94fe2c0c68b6a6ee7f7 (HEAD -> 35587163_fix)
>> Author: Wengang Wang <wen.gang.wang@oracle.com>
>> Date:   Mon Aug 21 15:03:58 2023 -0700
>> 
>>    xfs: reserve less log space when recovering EFIs
>> 
>>    Currently tr_itruncate is used for both run time truncating and
>>    boot time EFI recovery. tr_itruncate
>>    {
>>       tr_logres = 175488,
>>       tr_logcount = 2,
>>       tr_logflags = XFS_TRANS_PERM_LOG_RES,
>>    }
>> 
>>    Is a permanent two-transaction series. Actually only the second transaction
>>    is really used to free extents and that needs half of the log space reservation
>>    from tr_itruncate.
>> 
>>    For EFI recovery, the things to do is just to free extents, so it doesn't
>>    needs full log space reservation by tr_itruncate. It needs half of it and
>>    shouldn't need more than that.
>> 
>>    Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com>
>> 
>> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
>> index f1a5ecf099aa..428984e48d23 100644
>> --- a/fs/xfs/xfs_extfree_item.c
>> +++ b/fs/xfs/xfs_extfree_item.c
> u> @@ -667,6 +667,7 @@ xfs_efi_item_recover(
>>        int                             i;
>>        int                             error = 0;
>>        bool                            requeue_only = false;
>> +       struct xfs_trans_res            tres;
>> 
>>        /*
>>         * First check the validity of the extents described by the
>> @@ -683,7 +684,10 @@ xfs_efi_item_recover(
>>                }
>>        }
>> 
>> -       error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
>> +       tres.tr_logres = M_RES(mp)->tr_itruncate.tr_logres;
>> +       tres.tr_logcount = 0;
> 
> HAH that was fast.  I'm glad it worked.
> 
>> +       tres.tr_logflags = 0;
> 
> Though I think we should preserve XFS_TRANS_PERM_LOG_RES since other
> parts of the codebase check that the flag was conveyed from the
> transaction reservation into tp->t_flags itself.
> 

Sure, I could do that.
I just wondering the meaning of XFS_TRANS_PERM_LOG_RES.
I am thinking it means that the to be reserved log bytes would be used
for multiple transactions in a chain. And the tr_logcount indicates the number
of the transactions in the chain.
So my understand is that a single transaction is not a chain (surely,
you can understand it as a one-transaction chain). So XFS_TRANS_PERM_LOG_RES
is not needed.  It that a correct understanding?

> I think for an upstream patch we'd rather fix all of them, and in a
> systemic way.  How about adding this to xfs_log_recover.h:
> 
> /*
> * Transform a regular reservation into one suitable for recovery of a
> * log intent item.
> *
> * Intent recovery only runs a single step of the transaction chain and
> * defers the rest to a separate transaction.  Therefore, we reduce
> * logcount to 1 here to avoid livelocks if the log grant space is
> * nearly exhausted due to the recovered intent pinning the tail.  Keep
> * the same logflags to avoid tripping asserts elsewhere.  Struct copies
> * abound below.
> */
> static inline struct xfs_trans_res
> xlog_recover_resv(const struct xfs_trans_res *template)
> {
> struct xfs_trans_res ret = {
> .tr_logres = template->tr_logres,
> .tr_logcount = 1,
> .tr_logflags = template->tr_logflags,
> };
> 
> return ret;
> }
> 
> and then this becomes:
> 
> struct xfs_trans_res resv;
> 
> resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
> error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
> 
> which should simplify fixing this for the other recovery functions?
> 

Yes, I could do that.

thanks,
wengang
> --D
> 
>> +       error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp);
> 
>>        if (error)
>>                return error;
>>        efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
>> 
>> thanks,
>> wengang
>> 
>>> thanks,
>>> wengang
>>> 
>>>> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
>>>> 
>>>> 
>>>> 
>>>>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>> 
>>>>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
>>>>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
>>>>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
>>>>>>>> FYI:
>>>>>>>> 
>>>>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on
>>>>>>>> both 4.14.35 kernel or 6.4.0 kernel.
>>>>>>>> Reproduce steps:
>>>>>>>> 
>>>>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
>>>>>>>> steps all aim at this XFS volume.
>>>>>>> 
>>>>>>> Actually, make that a few milliseconds.... :)
>>>>>> 
>>>>>> :)
>>>>>> 
>>>>>>> mkfs/xfs_info output would be appreciated.
>>>>>> 
>>>>>> sure,
>>>>>> # xfs_info 20GB.bk2
>>>>>> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
>>>>>>      =                       sectsz=512   attr=2, projid32bit=1
>>>>>>      =                       crc=0        finobt=0, sparse=0, rmapbt=0
>>>>>>      =                       reflink=0
>>>>> 
>>>>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
>>>>> and support is largely due to be dropped from upstream in 2025...
>>>>> 
>>>> 
>>>> Ha, it just happened to be so.
>>>> I was trying to reproduce it in the same environment as customer has —
>>>> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format
>>>> v4 filesystems.  I created the xfs image in a file on OL7 and copied the image
>>>> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here.
>>>> 
>>>>> Does the same problem occur with a v5 filesystems?
>>>> 
>>>> Will test and report back.
>>>> 
>>>>> 
>>>>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
>>>>>>>> kernel and 6.4.0 kernel.
>>>>>>> 
>>>>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e
>>>>>>> tr_itrunc), because that's what the EFI being processed that pins
>>>>>>> the tail of the log is supposed to have reserved when it was
>>>>>>> stalled.
>>>>>> 
>>>>>> Yep, exactly.
>>>>>> 
>>>>>>> So where has the ~180kB of leaked space come from?
>>>>>>> 
>>>>>>> Have you traced the grant head reservations to find out
>>>>>>> what the runtime log space and grant head reservations actually are?
>>>>>> I have the numbers in vmcore (ignore the WARNs),
>>>>> 
>>>>> That's not what I'm asking. You've dumped the values at the time of
>>>>> the hang, not traced the runtime reservations that have been made.
>>>>> 
>>>>>>> i.e. we have full tracing of the log reservation accounting via
>>>>>>> tracepoints in the kernel. If there is a leak occurring, you need to
>>>>>>> capture a trace of all the reservation accounting operations and
>>>>>>> post process the output to find out what operation is leaking
>>>>>>> reserved space. e.g.
>>>>>>> 
>>>>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
>>>>>>> ....
>>>>>>> # trace-cmd report > s.t
>>>>>>> # head -3 s.t
>>>>>>> cpus=16
>>>>>>>      touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>>>>>      touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
>>>>>>> 
>>>>>>> #
>>>>>>> 
>>>>>>> So this tells us the transaction reservation unit size, the count of
>>>>>>> reservations, the current reserve and grant head locations, and the
>>>>>>> current head and tail of the log at the time the transaction
>>>>>>> reservation is started and then after it completes.
>>>>>> 
>>>>>> Will do that and report back. You want full log or only some typical
>>>>>> ones? Full log would be big, how shall I share? 
>>>>> 
>>>>> I don't want to see the log. It'll be huge - I regularly generate
>>>>> traces containing gigabytes of log accounting traces like this from
>>>>> a single workload.
>>>>> 
>>>>> What I'm asking you to do is run the tracing and then post process
>>>>> the values from the trace to determine what operation is using more
>>>>> space than is being freed back to the log.
>>>>> 
>>>>> I generally do this with grep, awk and sed. some people use python
>>>>> or perl. But either way it's a *lot* of work - in the past I have
>>>>> spent _weeks_ on trace analysis to find a 4 byte leak in the log
>>>>> space accounting. DOing things like graphing the head, tail and grant
>>>>> spaces over time tend to show if this is a gradual leak versus a
>>>>> sudden step change. If it's a sudden step change, then you can
>>>>> isolate it in the trace and work out what happened. If it's a
>>>>> gradual change, then you need to start looking for accounting
>>>>> discrepancies...
>>>>> 
>>>>> e.g. a transaction records 32 bytes used in the item, so it releases
>>>>> t_unit - 32 bytes at commit. However, the CIL may then only track 28
>>>>> bytes of space for the item in the journal and we leak 4 bytes of
>>>>> reservation on every on of those items committed.
>>>>> 
>>>>> These sorts of leaks typically only add up to being somethign
>>>>> significant in situations where the log is flooded with tiny inode
>>>>> timestamp changes - 4 bytes iper item doesn't really matter when you
>>>>> only have a few thousand items in the log, but when you have
>>>>> hundreds of thousands of tiny items in the log...
>>>>> 
>>>> 
>>>> OK. will work more on this.
>>>> # I am going to start a two-week vacation, and will then continue on this when back.
>>>> 
>>>> thanks,
>>>> wengang



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-18  3:25                 ` Wengang Wang
  2023-08-21 22:06                   ` Wengang Wang
@ 2023-08-24  4:52                   ` Darrick J. Wong
  2023-08-24  7:28                     ` Dave Chinner
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2023-08-24  4:52 UTC (permalink / raw)
  To: Wengang Wang; +Cc: Dave Chinner, linux-xfs@vger.kernel.org, Srikanth C S

On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> What things looks to be:
> 
> For the file deletion.  log bytes are reserved basing on xfs_mount->tr_itruncate which is:
> {
>     tr_logres = 175488,
>     tr_logcount = 2,
>     tr_logflags = XFS_TRANS_PERM_LOG_RES,
> }  
> You see it’s permanent log reservation with two log operations (two transactions in rolling mode).
> After calculation (xlog_calc_unit_res(), adding space for various log headers), the final
> log space needed per transaction changes from  175488 to 180208 bytes. So the total
> log space needed is 360416 (180208 * 2).   
> Above log space (360416 bytes) needs to be reserved for both run time inode removing
> (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()).
> 
> RUNTIME
> ========
> 
> For run time inode removing. The first transaction is mainly used for inode fields change.
> The second transaction is used for intents including extent freeing.
> 
> For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved
> for the coming transaction).  
> The first transaction is committed, writing some bytes to log and releasing the left reserved bytes.
> 
> Now we have the second transaction which has 180208 log bytes reserved too. The second
> transaction is supposed to process intents including extent freeing. With my hacking patch,
> I blocked the extent freeing 5 hours. So in that 5 hours,  180208 (NOT 360416) log bytes are reserved.
> 
> With my test case, other transactions (update timestamps) then happen. As my hacking patch
> pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available
> log space (in memory in on disk).  And finally the on disk (and in memory) available log space
> goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free)
> transaction.
> 
> Panic the kernel and remount the xfs volume
> 
> LOG RECOVER
> =============
> 
> With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs
> 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available.
> 
> THINKING
> ========
> Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd
> transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes.  We have
> (a bit) more than 180208 available log bytes  on disk, so the reservation goes and the recovery goes.
> That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res
> xfs_mount->tr_ext_free
> {
>     tr_logres = 175488,
>     tr_logcount = 0,
>     tr_logflags = 0,
> }
> and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it).

Hmm.  The log intent item recovery functions (e.g. xfs_efi_item_recover)
take the recovered log item and perform *exactly* one step of the
transaction chain.  If there's more work to do, the
xfs_defer_ops_capture_and_commit machinery captures the defer ops, block
reservation, and transaction reservation for later.

Later, in xlog_finish_defer_ops, we "rehydrate" the old transaction and
block reservation, attach the defer ops, and finish the chain:

		/*
		 * Create a new transaction reservation from the
		 * captured information.  Set logcount to 1 to force the
		 * new transaction to regrant every roll so that we can
		 * make forward progress in recovery no matter how full
		 * the log might be.
		 */
		resv.tr_logres = dfc->dfc_logres;
		resv.tr_logcount = 1;
		resv.tr_logflags = XFS_TRANS_PERM_LOG_RES;

		error = xfs_trans_alloc(mp, &resv, dfc->dfc_blkres,
				dfc->dfc_rtxres, XFS_TRANS_RESERVE,
				&tp);
		if (error) {
			xlog_force_shutdown(mp->m_log,
					SHUTDOWN_LOG_IO_ERROR);
			return error;
		}

		/*
		 * Transfer to this new transaction all the dfops we
		 * captured from recovering a single intent item.
		 */
		list_del_init(&dfc->dfc_list);
		xfs_defer_ops_continue(dfc, tp, &dres);
		error = xfs_trans_commit(tp);

Since xfs_efi_item_recover is only performing one step of what could be
a chain of deferred updates, it never rolls the transaction that it
creates.  It therefore only requires the amount of grant space that
you'd get with tr_logcount == 1.  It is therefore a bit silly that we
ask for more than that, and in bad cases like this, hang log recovery
needlessly.

Which is exactly what you theorized above.  Ok, I'm starting to be
convinced... :)

I wonder, if you add this to the variable declarations in
xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about
UEK5 here):

	struct xfs_trans_resv	resv = M_RES(mp)->tr_itruncate;

and then change the xfs_trans_alloc call to:

	resv.tr_logcount = 1;
	error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);

Does that solve the problem?

--D

> thanks,
> wengang
> 
> > On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote:
> > 
> > 
> > 
> >> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote:
> >> 
> >> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote:
> >>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote:
> >>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote:
> >>>>> FYI:
> >>>>> 
> >>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on
> >>>>> both 4.14.35 kernel or 6.4.0 kernel.
> >>>>> Reproduce steps:
> >>>>> 
> >>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following
> >>>>> steps all aim at this XFS volume.
> >>>> 
> >>>> Actually, make that a few milliseconds.... :)
> >>> 
> >>> :)
> >>> 
> >>>> mkfs/xfs_info output would be appreciated.
> >>> 
> >>> sure,
> >>> # xfs_info 20GB.bk2
> >>> meta-data=20GB.bk2               isize=256    agcount=4, agsize=1310720 blks
> >>>        =                       sectsz=512   attr=2, projid32bit=1
> >>>        =                       crc=0        finobt=0, sparse=0, rmapbt=0
> >>>        =                       reflink=0
> >> 
> >> Hmmmm. Why are you only testing v4 filesystems? They are deprecated
> >> and support is largely due to be dropped from upstream in 2025...
> >> 
> > 
> > Ha, it just happened to be so.
> > I was trying to reproduce it in the same environment as customer has —
> > that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format
> > v4 filesystems.  I created the xfs image in a file on OL7 and copied the image
> > file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here.
> > 
> >> Does the same problem occur with a v5 filesystems?
> > 
> > Will test and report back.
> > 
> >> 
> >>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35
> >>>>> kernel and 6.4.0 kernel.
> >>>> 
> >>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e
> >>>> tr_itrunc), because that's what the EFI being processed that pins
> >>>> the tail of the log is supposed to have reserved when it was
> >>>> stalled.
> >>> 
> >>> Yep, exactly.
> >>> 
> >>>> So where has the ~180kB of leaked space come from?
> >>>> 
> >>>> Have you traced the grant head reservations to find out
> >>>> what the runtime log space and grant head reservations actually are?
> >>> I have the numbers in vmcore (ignore the WARNs),
> >> 
> >> That's not what I'm asking. You've dumped the values at the time of
> >> the hang, not traced the runtime reservations that have been made.
> >> 
> >>>> i.e. we have full tracing of the log reservation accounting via
> >>>> tracepoints in the kernel. If there is a leak occurring, you need to
> >>>> capture a trace of all the reservation accounting operations and
> >>>> post process the output to find out what operation is leaking
> >>>> reserved space. e.g.
> >>>> 
> >>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo
> >>>> ....
> >>>> # trace-cmd report > s.t
> >>>> # head -3 s.t
> >>>> cpus=16
> >>>>        touch-289000 [008] 430907.633820: xfs_log_reserve:      dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> >>>>        touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2
> >>>> 
> >>>> #
> >>>> 
> >>>> So this tells us the transaction reservation unit size, the count of
> >>>> reservations, the current reserve and grant head locations, and the
> >>>> current head and tail of the log at the time the transaction
> >>>> reservation is started and then after it completes.
> >>> 
> >>> Will do that and report back. You want full log or only some typical
> >>> ones? Full log would be big, how shall I share? 
> >> 
> >> I don't want to see the log. It'll be huge - I regularly generate
> >> traces containing gigabytes of log accounting traces like this from
> >> a single workload.
> >> 
> >> What I'm asking you to do is run the tracing and then post process
> >> the values from the trace to determine what operation is using more
> >> space than is being freed back to the log.
> >> 
> >> I generally do this with grep, awk and sed. some people use python
> >> or perl. But either way it's a *lot* of work - in the past I have
> >> spent _weeks_ on trace analysis to find a 4 byte leak in the log
> >> space accounting. DOing things like graphing the head, tail and grant
> >> spaces over time tend to show if this is a gradual leak versus a
> >> sudden step change. If it's a sudden step change, then you can
> >> isolate it in the trace and work out what happened. If it's a
> >> gradual change, then you need to start looking for accounting
> >> discrepancies...
> >> 
> >> e.g. a transaction records 32 bytes used in the item, so it releases
> >> t_unit - 32 bytes at commit. However, the CIL may then only track 28
> >> bytes of space for the item in the journal and we leak 4 bytes of
> >> reservation on every on of those items committed.
> >> 
> >> These sorts of leaks typically only add up to being somethign
> >> significant in situations where the log is flooded with tiny inode
> >> timestamp changes - 4 bytes iper item doesn't really matter when you
> >> only have a few thousand items in the log, but when you have
> >> hundreds of thousands of tiny items in the log...
> >> 
> > 
> > OK. will work more on this.
> > # I am going to start a two-week vacation, and will then continue on this when back.
> > 
> > thanks,
> > wengang
> 
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-24  4:52                   ` Darrick J. Wong
@ 2023-08-24  7:28                     ` Dave Chinner
  2023-08-24 22:01                       ` Darrick J. Wong
  2023-08-24 23:53                       ` Wengang Wang
  0 siblings, 2 replies; 22+ messages in thread
From: Dave Chinner @ 2023-08-24  7:28 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S

On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
> On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> 
> Since xfs_efi_item_recover is only performing one step of what could be
> a chain of deferred updates, it never rolls the transaction that it
> creates.  It therefore only requires the amount of grant space that
> you'd get with tr_logcount == 1.  It is therefore a bit silly that we
> ask for more than that, and in bad cases like this, hang log recovery
> needlessly.

But this doesn't fix the whatever problem lead to the recovery not
having the same full tr_itruncate reservation available as was held
by the transaction that logged the EFI and was running the extent
free at the time the system crashed. There should -always- be enough
transaction reservation space in the journal to reserve space for an
intent replay if the intent recovery reservation uses the same
reservation type as runtime.

Hence I think this is still just a band-aid over whatever went wrong
at runtime that lead to the log not having enough space for a
reservation that was clearly held at runtime and hadn't yet used.

> Which is exactly what you theorized above.  Ok, I'm starting to be
> convinced... :)

I'm not. :/

> I wonder, if you add this to the variable declarations in
> xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about
> UEK5 here):
> 
> 	struct xfs_trans_resv	resv = M_RES(mp)->tr_itruncate;
> 
> and then change the xfs_trans_alloc call to:
> 
> 	resv.tr_logcount = 1;
> 	error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
> 
> Does that solve the problem?

It might fix this specific case given the state of the log at the
time the system crashed. However, it doesn't help the general
case where whatever went wrong at runtime results in there being
less than a single tr_itruncate reservation unit available in the
log.

One of the recent RH custom cases I looked at had much less than a
single tr_itruncate unit reservation when it came to recovering the
EFI, so I'm not just making this up.

I really don't think this problem is not solvable at recovery time.
if the runtime hang/crash leaves the EFI pinning the tail of the log
and something has allowed the log to be overfull (i.e. log space
used plus granted EFI reservation > physical log space), then we may
not have any space at all for any sort of reservation during
recovery.

I suspect that the cause of this recovery issue is the we require an
overcommit of the log space accounting before we throttle incoming
transaction reservations (i.e. the new reservation has to overrun
before we throttle). I think that the reservation accounting overrun
detection can race to the first item being placed on the wait list,
which might allow transactions that should be throttled to escape
resulting in a temporary log space overcommit. IF we crash at jsut
the wrong time and an intent-in-progress pins the tail of the log,
the overcommit situation can lead to recovery not having enough
space for intent recovery reservations...

This subtle overcommit is one of the issues I have been trying to
correct with the byte-based grant head accounting patches (which I'm
-finally- getting back to). I know that replaying the log from the
metadump that repoduces the hang with the byte-based grant head
accounting patchset allowed log recovery to succeed. It has a
different concept of where the head and tail are and hence how much
log space is actually available at any given time, and that
difference was just enough to allow a tr_itruncate reservation to
succced. It also has different reservation grant overrun detection
logic, but I'm not 100% sure if it solves this underlying runtime
overcommit problem or not yet...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-24  7:28                     ` Dave Chinner
@ 2023-08-24 22:01                       ` Darrick J. Wong
  2023-08-26  3:37                         ` Dave Chinner
  2023-08-24 23:53                       ` Wengang Wang
  1 sibling, 1 reply; 22+ messages in thread
From: Darrick J. Wong @ 2023-08-24 22:01 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S

On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote:
> On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
> > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> > 
> > Since xfs_efi_item_recover is only performing one step of what could be
> > a chain of deferred updates, it never rolls the transaction that it
> > creates.  It therefore only requires the amount of grant space that
> > you'd get with tr_logcount == 1.  It is therefore a bit silly that we
> > ask for more than that, and in bad cases like this, hang log recovery
> > needlessly.
> 
> But this doesn't fix the whatever problem lead to the recovery not
> having the same full tr_itruncate reservation available as was held
> by the transaction that logged the EFI and was running the extent
> free at the time the system crashed. There should -always- be enough
> transaction reservation space in the journal to reserve space for an
> intent replay if the intent recovery reservation uses the same
> reservation type as runtime.
> 
> Hence I think this is still just a band-aid over whatever went wrong
> at runtime that lead to the log not having enough space for a
> reservation that was clearly held at runtime and hadn't yet used.

Maybe I'm not remembering accurately how permanent log reservations
work.  Let's continue picking on tr_itruncate from Wengang's example.
IIRC, he said that tr_itruncate on the running system was defined
roughly like so:

tr_itruncate = {
	.tr_logres	= 180K
	.tr_logcount	= 2,
	.tr_logflags	= XFS_TRANS_PERM_LOG_RES,
}

At runtime, when we want to start a truncation update, we do this:

	xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...);

Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve
-> xlog_ticket_alloc.  Ticket allocation assigns tic->t_unit_res =
(tr_logres + overhead) and tic->t_cnt = tr_logcount.

(Let's pretend for the sake of argument that the overhead is 5K.)

Now xfs_log_reserve calls xlog_grant_head_check.  That does:

	*need_bytes = xlog_ticket_reservation(log, head, tic);

For the reserve head, the ticket reservation computation is
(tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) ==
370K, right?  So we make sure there's at least 370K free in the reserve
head, then add that to the reserve and write heads.

Now that we've allocated the transaction, delete the bmap mapping,
log an EFI to free the space, and roll the transaction as part of
finishing the deferops chain.  Rolling creates a new xfs_trans which
shares its ticket with the old transaction.  Next, xfs_trans_roll calls
__xfs_trans_commit with regrant == true, which calls xlog_cil_commit
with the same regrant parameter.

xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
subtracts t_curr_res from the reservation and write heads.

If the filesystem is fresh and the first transaction only used (say)
20K, then t_curr_res will be 165K, and we give that much reservation
back to the reservation head.  Or if the file is really fragmented and
the first transaction actually uses 170K, then t_curr_res will be 15K,
and that's what we give back to the reservation.

Having done that, we're now headed into the second transaction with an
EFI and 185K of reservation, correct?

Now let's say the first transaction gets written to disk and we crash
without ever completing the second transaction.  Now we remount the fs,
log recovery finds the unfinished EFI, and calls xfs_efi_recover to
finish the EFI.  However, xfs_efi_recover starts a new tr_itruncate
tranasction, which asks for (185K * 2) == 370K of log reservation.
This is a lot more than the 185K that we had reserved at the time of the
crash.

Did I get all that right?

If so, what if prior to the crash, the system had many other threads
initiating much smaller transactions, say for mtime updates?  And what
if those threads consumed all the rest of the log reservation and pushed
those transactions out to disk after the first transaction in the
tr_itruncate chain?

Won't log recovery then find the log to contain (in order):

1. The first itruncate transaction and the EFI.
2. A bunch of mtime update transactions.
3. Approximately 185K of free space.

We can't move the log tail forward because we haven't dealt with the
EFI; there's only 185K of log space left to reserve; and xfs_efi_recover
wants 370K.  It's only going to take one step in the intent processing
chain and commit that step, so I don't see why the recovery transaction
needs more than one step's worth of log space.  Any subsequent steps in
the deferops chain get a new transaction later once we've unpinned the
log tail.

> > Which is exactly what you theorized above.  Ok, I'm starting to be
> > convinced... :)
> 
> I'm not. :/
> 
> > I wonder, if you add this to the variable declarations in
> > xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about
> > UEK5 here):
> > 
> > 	struct xfs_trans_resv	resv = M_RES(mp)->tr_itruncate;
> > 
> > and then change the xfs_trans_alloc call to:
> > 
> > 	resv.tr_logcount = 1;
> > 	error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
> > 
> > Does that solve the problem?
> 
> It might fix this specific case given the state of the log at the
> time the system crashed.

Yes.  There are running kernels out there that persist this situation
(bugs and all) in the ondisk log.  I know so, because the issue that
Wengang has been dealing with came from a customer.  IIRC, the customer
report stated that they were truncating some heavily fragmented sparse
files, the node stalled long enough that the node manager rebooted the
node, and when the node came back up, log recovery stalled because
that's what the running kernel had written to the log.

IOWs, I'm operating on an assumption that we have two problems to solve:
the runtime acconting bug that you've been chasing, and Wengang's where
log recovery asks for more log space than what had been in the log
ticket at the time of the crash.

> However, it doesn't help the general
> case where whatever went wrong at runtime results in there being
> less than a single tr_itruncate reservation unit available in the
> log.

I agree that this fix won't address the problem of how runtime got here
in the first place.

> One of the recent RH custom cases I looked at had much less than a
> single tr_itruncate unit reservation when it came to recovering the
> EFI, so I'm not just making this up.

I didn't think you were, I merely am not ready to decide that these
aren't separate problems. :)

> I really don't think this problem is not solvable at recovery time.
> if the runtime hang/crash leaves the EFI pinning the tail of the log
> and something has allowed the log to be overfull (i.e. log space
> used plus granted EFI reservation > physical log space), then we may
> not have any space at all for any sort of reservation during
> recovery.
> 
> I suspect that the cause of this recovery issue is the we require an
> overcommit of the log space accounting before we throttle incoming
> transaction reservations (i.e. the new reservation has to overrun
> before we throttle). I think that the reservation accounting overrun
> detection can race to the first item being placed on the wait list,

Yeah, I was wondering that myself when I was looking at the logic
between list_empty_careful and the second xlog_grant_head_wait and
wondering if that "careful" construction actually worked.

> which might allow transactions that should be throttled to escape
> resulting in a temporary log space overcommit. IF we crash at jsut
> the wrong time and an intent-in-progress pins the tail of the log,
> the overcommit situation can lead to recovery not having enough
> space for intent recovery reservations...
> 
> This subtle overcommit is one of the issues I have been trying to
> correct with the byte-based grant head accounting patches (which I'm
> -finally- getting back to). I know that replaying the log from the
> metadump that repoduces the hang with the byte-based grant head
> accounting patchset allowed log recovery to succeed. It has a
> different concept of where the head and tail are and hence how much
> log space is actually available at any given time, and that
> difference was just enough to allow a tr_itruncate reservation to
> succced. It also has different reservation grant overrun detection
> logic, but I'm not 100% sure if it solves this underlying runtime
> overcommit problem or not yet...

<nod>

--D

> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-24 22:01                       ` Darrick J. Wong
@ 2023-08-26  3:37                         ` Dave Chinner
  2023-08-27 16:04                           ` Darrick J. Wong
  0 siblings, 1 reply; 22+ messages in thread
From: Dave Chinner @ 2023-08-26  3:37 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S

On Thu, Aug 24, 2023 at 03:01:54PM -0700, Darrick J. Wong wrote:
> On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote:
> > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
> > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> > > 
> > > Since xfs_efi_item_recover is only performing one step of what could be
> > > a chain of deferred updates, it never rolls the transaction that it
> > > creates.  It therefore only requires the amount of grant space that
> > > you'd get with tr_logcount == 1.  It is therefore a bit silly that we
> > > ask for more than that, and in bad cases like this, hang log recovery
> > > needlessly.
> > 
> > But this doesn't fix the whatever problem lead to the recovery not
> > having the same full tr_itruncate reservation available as was held
> > by the transaction that logged the EFI and was running the extent
> > free at the time the system crashed. There should -always- be enough
> > transaction reservation space in the journal to reserve space for an
> > intent replay if the intent recovery reservation uses the same
> > reservation type as runtime.
> > 
> > Hence I think this is still just a band-aid over whatever went wrong
> > at runtime that lead to the log not having enough space for a
> > reservation that was clearly held at runtime and hadn't yet used.
> 
> Maybe I'm not remembering accurately how permanent log reservations
> work.  Let's continue picking on tr_itruncate from Wengang's example.
> IIRC, he said that tr_itruncate on the running system was defined
> roughly like so:
> 
> tr_itruncate = {
> 	.tr_logres	= 180K
> 	.tr_logcount	= 2,
> 	.tr_logflags	= XFS_TRANS_PERM_LOG_RES,
> }
> 
> At runtime, when we want to start a truncation update, we do this:
> 
> 	xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...);
> 
> Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve
> -> xlog_ticket_alloc.  Ticket allocation assigns tic->t_unit_res =
> (tr_logres + overhead) and tic->t_cnt = tr_logcount.
> 
> (Let's pretend for the sake of argument that the overhead is 5K.)
> 
> Now xfs_log_reserve calls xlog_grant_head_check.  That does:
> 
> 	*need_bytes = xlog_ticket_reservation(log, head, tic);
> 
> For the reserve head, the ticket reservation computation is
> (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) ==
> 370K, right?  So we make sure there's at least 370K free in the reserve
> head, then add that to the reserve and write heads.
> 
> Now that we've allocated the transaction, delete the bmap mapping,
> log an EFI to free the space, and roll the transaction as part of
> finishing the deferops chain.  Rolling creates a new xfs_trans which
> shares its ticket with the old transaction.  Next, xfs_trans_roll calls
> __xfs_trans_commit with regrant == true, which calls xlog_cil_commit
> with the same regrant parameter.
> 
> xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
> subtracts t_curr_res from the reservation and write heads.
> 
> If the filesystem is fresh and the first transaction only used (say)
> 20K, then t_curr_res will be 165K, and we give that much reservation
> back to the reservation head.  Or if the file is really fragmented and
> the first transaction actually uses 170K, then t_curr_res will be 15K,
> and that's what we give back to the reservation.
> 
> Having done that, we're now headed into the second transaction with an
> EFI and 185K of reservation, correct?

Ah, right, I overlooked that the long running truncates only regrant a
single reservation unit at a time when they roll, and the runtime
instances I seen of this are with long running truncate operations
(i.e. inactivation after unlink for multi-thousand extent inodes).

So, yes, all types of intent recovery must only use a single unit
reservation, not just EFIs, because there is no guarantee that there
is a full unit * count reservation available in the journal whenever
that intent was first logged. Indeed, at best it will be 'count - 1'
that is avaialble, becuase the transaction that logged the intent
will have used a full unit of the original reservation....

> IOWs, I'm operating on an assumption that we have two problems to solve:
> the runtime acconting bug that you've been chasing, and Wengang's where
> log recovery asks for more log space than what had been in the log
> ticket at the time of the crash.

OK, yes, it does seem there are two problems here...

> > I suspect that the cause of this recovery issue is the we require an
> > overcommit of the log space accounting before we throttle incoming
> > transaction reservations (i.e. the new reservation has to overrun
> > before we throttle). I think that the reservation accounting overrun
> > detection can race to the first item being placed on the wait list,
> 
> Yeah, I was wondering that myself when I was looking at the logic
> between list_empty_careful and the second xlog_grant_head_wait and
> wondering if that "careful" construction actually worked.

Well, it's not the careful check that is the problem; the race is
much simpler than that. We just have to have two concurrent
reservations that will both fit in the remaining log space
individually but not together. i.e. this is the overcommit race
window:

P1				P2
---------------------------	--------------------------
xfs_log_reserve(10000 bytes)
  xlog_grant_head_check()
    xlog_space_left()
      <sees 12000 bytes>
    OK, doesn't wait
  				xfs_log_reserve(10000 bytes)
				  xlog_grant_head_check()
				    xlog_space_left()
				      <sees 12000 bytes>
				    OK, doesn't wait
    xlog_grant_add_space(10k)
    				  xlog_grant_add_space(10k)

And now we've overcommitted the log by 8000 bytes....

The current byte based grant head patch set does not address this;
the attempt I made to fix it didn't work properly, so I split the
rework of the throttling out of the byte based accounting patchset
(which seems to work correctly without reworking
xlog_grant_head_check()) and now I'm trying to address this race
condition as a separate patchset...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-26  3:37                         ` Dave Chinner
@ 2023-08-27 16:04                           ` Darrick J. Wong
  0 siblings, 0 replies; 22+ messages in thread
From: Darrick J. Wong @ 2023-08-27 16:04 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S

On Sat, Aug 26, 2023 at 01:37:24PM +1000, Dave Chinner wrote:
> On Thu, Aug 24, 2023 at 03:01:54PM -0700, Darrick J. Wong wrote:
> > On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote:
> > > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
> > > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
> > > > 
> > > > Since xfs_efi_item_recover is only performing one step of what could be
> > > > a chain of deferred updates, it never rolls the transaction that it
> > > > creates.  It therefore only requires the amount of grant space that
> > > > you'd get with tr_logcount == 1.  It is therefore a bit silly that we
> > > > ask for more than that, and in bad cases like this, hang log recovery
> > > > needlessly.
> > > 
> > > But this doesn't fix the whatever problem lead to the recovery not
> > > having the same full tr_itruncate reservation available as was held
> > > by the transaction that logged the EFI and was running the extent
> > > free at the time the system crashed. There should -always- be enough
> > > transaction reservation space in the journal to reserve space for an
> > > intent replay if the intent recovery reservation uses the same
> > > reservation type as runtime.
> > > 
> > > Hence I think this is still just a band-aid over whatever went wrong
> > > at runtime that lead to the log not having enough space for a
> > > reservation that was clearly held at runtime and hadn't yet used.
> > 
> > Maybe I'm not remembering accurately how permanent log reservations
> > work.  Let's continue picking on tr_itruncate from Wengang's example.
> > IIRC, he said that tr_itruncate on the running system was defined
> > roughly like so:
> > 
> > tr_itruncate = {
> > 	.tr_logres	= 180K
> > 	.tr_logcount	= 2,
> > 	.tr_logflags	= XFS_TRANS_PERM_LOG_RES,
> > }
> > 
> > At runtime, when we want to start a truncation update, we do this:
> > 
> > 	xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...);
> > 
> > Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve
> > -> xlog_ticket_alloc.  Ticket allocation assigns tic->t_unit_res =
> > (tr_logres + overhead) and tic->t_cnt = tr_logcount.
> > 
> > (Let's pretend for the sake of argument that the overhead is 5K.)
> > 
> > Now xfs_log_reserve calls xlog_grant_head_check.  That does:
> > 
> > 	*need_bytes = xlog_ticket_reservation(log, head, tic);
> > 
> > For the reserve head, the ticket reservation computation is
> > (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) ==
> > 370K, right?  So we make sure there's at least 370K free in the reserve
> > head, then add that to the reserve and write heads.
> > 
> > Now that we've allocated the transaction, delete the bmap mapping,
> > log an EFI to free the space, and roll the transaction as part of
> > finishing the deferops chain.  Rolling creates a new xfs_trans which
> > shares its ticket with the old transaction.  Next, xfs_trans_roll calls
> > __xfs_trans_commit with regrant == true, which calls xlog_cil_commit
> > with the same regrant parameter.
> > 
> > xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and
> > subtracts t_curr_res from the reservation and write heads.
> > 
> > If the filesystem is fresh and the first transaction only used (say)
> > 20K, then t_curr_res will be 165K, and we give that much reservation
> > back to the reservation head.  Or if the file is really fragmented and
> > the first transaction actually uses 170K, then t_curr_res will be 15K,
> > and that's what we give back to the reservation.
> > 
> > Having done that, we're now headed into the second transaction with an
> > EFI and 185K of reservation, correct?
> 
> Ah, right, I overlooked that the long running truncates only regrant a
> single reservation unit at a time when they roll, and the runtime
> instances I seen of this are with long running truncate operations
> (i.e. inactivation after unlink for multi-thousand extent inodes).
> 
> So, yes, all types of intent recovery must only use a single unit
> reservation, not just EFIs, because there is no guarantee that there
> is a full unit * count reservation available in the journal whenever
> that intent was first logged. Indeed, at best it will be 'count - 1'
> that is avaialble, becuase the transaction that logged the intent
> will have used a full unit of the original reservation....

<nod> I think something like this will fix the problem, right?

diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 2420865f3007..a5100a11faf9 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -131,4 +131,26 @@ void xlog_check_buf_cancel_table(struct xlog *log);
 #define xlog_check_buf_cancel_table(log) do { } while (0)
 #endif
 
+/*
+ * Transform a regular reservation into one suitable for recovery of a log
+ * intent item.
+ *
+ * Intent recovery only runs a single step of the transaction chain and defers
+ * the rest to a separate transaction.  Therefore, we reduce logcount to 1 here
+ * to avoid livelocks if the log grant space is nearly exhausted due to the
+ * recovered intent pinning the tail.  Keep the same logflags to avoid tripping
+ * asserts elsewhere.  Struct copies abound below.
+ */
+static inline struct xfs_trans_res
+xlog_recover_resv(const struct xfs_trans_res *r)
+{
+	struct xfs_trans_res ret = {
+		.tr_logres	= r->tr_logres,
+		.tr_logcount	= 1,
+		.tr_logflags	= r->tr_logflags,
+	};
+
+	return ret;
+}
+
 #endif	/* __XFS_LOG_RECOVER_H__ */
diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c
index 2788a6f2edcd..36fe2abb16e6 100644
--- a/fs/xfs/xfs_attr_item.c
+++ b/fs/xfs/xfs_attr_item.c
@@ -547,7 +547,7 @@ xfs_attri_item_recover(
 	struct xfs_inode		*ip;
 	struct xfs_da_args		*args;
 	struct xfs_trans		*tp;
-	struct xfs_trans_res		tres;
+	struct xfs_trans_res		resv;
 	struct xfs_attri_log_format	*attrp;
 	struct xfs_attri_log_nameval	*nv = attrip->attri_nameval;
 	int				error;
@@ -618,8 +618,9 @@ xfs_attri_item_recover(
 		goto out;
 	}
 
-	xfs_init_attr_trans(args, &tres, &total);
-	error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp);
+	xfs_init_attr_trans(args, &resv, &total);
+	resv = xlog_recover_resv(&resv);
+	error = xfs_trans_alloc(mp, &resv, total, 0, XFS_TRANS_RESERVE, &tp);
 	if (error)
 		goto out;
 
diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c
index 7551c3ec4ea5..e736a0844c89 100644
--- a/fs/xfs/xfs_bmap_item.c
+++ b/fs/xfs/xfs_bmap_item.c
@@ -490,6 +490,7 @@ xfs_bui_item_recover(
 	struct list_head		*capture_list)
 {
 	struct xfs_bmap_intent		fake = { };
+	struct xfs_trans_res		resv;
 	struct xfs_bui_log_item		*buip = BUI_ITEM(lip);
 	struct xfs_trans		*tp;
 	struct xfs_inode		*ip = NULL;
@@ -515,7 +516,8 @@ xfs_bui_item_recover(
 		return error;
 
 	/* Allocate transaction and do the work. */
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
+	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+	error = xfs_trans_alloc(mp, &resv,
 			XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp);
 	if (error)
 		goto err_rele;
diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index f1a5ecf099aa..3fa8789820ad 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -660,6 +660,7 @@ xfs_efi_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
+	struct xfs_trans_res		resv;
 	struct xfs_efi_log_item		*efip = EFI_ITEM(lip);
 	struct xfs_mount		*mp = lip->li_log->l_mp;
 	struct xfs_efd_log_item		*efdp;
@@ -683,7 +684,8 @@ xfs_efi_item_recover(
 		}
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+	error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
 	if (error)
 		return error;
 	efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents);
diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c
index edd8587658d5..2d4444d61e98 100644
--- a/fs/xfs/xfs_refcount_item.c
+++ b/fs/xfs/xfs_refcount_item.c
@@ -477,6 +477,7 @@ xfs_cui_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
+	struct xfs_trans_res		resv;
 	struct xfs_cui_log_item		*cuip = CUI_ITEM(lip);
 	struct xfs_cud_log_item		*cudp;
 	struct xfs_trans		*tp;
@@ -514,8 +515,9 @@ xfs_cui_item_recover(
 	 * doesn't fit.  We need to reserve enough blocks to handle a
 	 * full btree split on either end of the refcount range.
 	 */
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
-			mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp);
+	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+	error = xfs_trans_alloc(mp, &resv, mp->m_refc_maxlevels * 2, 0,
+			XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
 
diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c
index 520c7ebdfed8..0e0e747028da 100644
--- a/fs/xfs/xfs_rmap_item.c
+++ b/fs/xfs/xfs_rmap_item.c
@@ -507,6 +507,7 @@ xfs_rui_item_recover(
 	struct xfs_log_item		*lip,
 	struct list_head		*capture_list)
 {
+	struct xfs_trans_res		resv;
 	struct xfs_rui_log_item		*ruip = RUI_ITEM(lip);
 	struct xfs_rud_log_item		*rudp;
 	struct xfs_trans		*tp;
@@ -530,8 +531,9 @@ xfs_rui_item_recover(
 		}
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate,
-			mp->m_rmap_maxlevels, 0, XFS_TRANS_RESERVE, &tp);
+	resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate);
+	error = xfs_trans_alloc(mp, &resv, mp->m_rmap_maxlevels, 0,
+			XFS_TRANS_RESERVE, &tp);
 	if (error)
 		return error;
 	rudp = xfs_trans_get_rud(tp, ruip);

> > IOWs, I'm operating on an assumption that we have two problems to solve:
> > the runtime acconting bug that you've been chasing, and Wengang's where
> > log recovery asks for more log space than what had been in the log
> > ticket at the time of the crash.
> 
> OK, yes, it does seem there are two problems here...
> 
> > > I suspect that the cause of this recovery issue is the we require an
> > > overcommit of the log space accounting before we throttle incoming
> > > transaction reservations (i.e. the new reservation has to overrun
> > > before we throttle). I think that the reservation accounting overrun
> > > detection can race to the first item being placed on the wait list,
> > 
> > Yeah, I was wondering that myself when I was looking at the logic
> > between list_empty_careful and the second xlog_grant_head_wait and
> > wondering if that "careful" construction actually worked.
> 
> Well, it's not the careful check that is the problem; the race is
> much simpler than that. We just have to have two concurrent
> reservations that will both fit in the remaining log space
> individually but not together. i.e. this is the overcommit race
> window:
> 
> P1				P2
> ---------------------------	--------------------------
> xfs_log_reserve(10000 bytes)
>   xlog_grant_head_check()
>     xlog_space_left()
>       <sees 12000 bytes>
>     OK, doesn't wait
>   				xfs_log_reserve(10000 bytes)
> 				  xlog_grant_head_check()
> 				    xlog_space_left()
> 				      <sees 12000 bytes>
> 				    OK, doesn't wait
>     xlog_grant_add_space(10k)
>     				  xlog_grant_add_space(10k)
> 
> And now we've overcommitted the log by 8000 bytes....
> 
> The current byte based grant head patch set does not address this;
> the attempt I made to fix it didn't work properly, so I split the
> rework of the throttling out of the byte based accounting patchset
> (which seems to work correctly without reworking
> xlog_grant_head_check()) and now I'm trying to address this race
> condition as a separate patchset...

Oh.  Heh.  Yeah.

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-08-24  7:28                     ` Dave Chinner
  2023-08-24 22:01                       ` Darrick J. Wong
@ 2023-08-24 23:53                       ` Wengang Wang
  1 sibling, 0 replies; 22+ messages in thread
From: Wengang Wang @ 2023-08-24 23:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S



> On Aug 24, 2023, at 12:28 AM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote:
>> On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote:
>> 
>> Since xfs_efi_item_recover is only performing one step of what could be
>> a chain of deferred updates, it never rolls the transaction that it
>> creates.  It therefore only requires the amount of grant space that
>> you'd get with tr_logcount == 1.  It is therefore a bit silly that we
>> ask for more than that, and in bad cases like this, hang log recovery
>> needlessly.
> 
> But this doesn't fix the whatever problem lead to the recovery not
> having the same full tr_itruncate reservation available as was held
> by the transaction that logged the EFI and was running the extent
> free at the time the system crashed. There should -always- be enough
> transaction reservation space in the journal to reserve space for an
> intent replay if the intent recovery reservation uses the same
> reservation type as runtime.
> 
> Hence I think this is still just a band-aid over whatever went wrong
> at runtime that lead to the log not having enough space for a
> reservation that was clearly held at runtime and hadn't yet used.

Hi Dave,
I was thinking there is a problem previously, but I didn’t find anything wrong.
Per my test and analysis, the changing (adding and subtracting) to/from
xlog->l_reserve_head and xlog->l_write_head matches the changes of
xlog->l_curr_cycle/l_curr_block.  And xlog->l_curr_cycle/l_curr_block matches
the disk log head change.

Actually there is no problem at runtime. What’s wrong is just the recovery part.

As I said previously, it’s not true that we always have the full tr_itruncate reservation
at any time during the process. Let me show the detailed steps:

1) We do full tr_itruncate reservation for the two-transaction chain (half each),
and get the first transaction.

2) work with the first transaction, adding log items to it. Now we have full tr_itruncate
reservation.

3) in transaction roll, we get the second transaction. The second transaction
also has half tr_itruncate reservation.  Now the first transaction and the second
transaction together have full tr_itruncate reservation.

4) still in transaction roll, we commit the first transaction, including EFI log item(s)
    in it. The reservation belonging to the first transaction went in three different
    ways:
   1. some are used to store the log items (and the necessary headers).
   2. a small part is stolen by the ticket on checkpoint context.
   3. the left is released after the log items are committed to CIL.
       xlog_cil_commit() -> xfs_log_ticket_regrant() -> xlog_grant_sub_space().

   Now we have only half tr_itruncate reservation (for the second trans in chain),
   NOT full!

5) Work with the second transaction (it’s slow some how). Now we have half
   tr_itruncate reservation.

6) Parallel transitions come and go, using up other free log bytes. Now we have
   half tr_itruncate reservation. And on disk log, the free log bytes is only a bit
   more than half tr_itruncate reservation, but less than full tr_itruncate reservation.

7) Still work with the second transaction in the chain, freeing the extents, system
    crashed. Now half tr_itruncate reservation.

So you see, since step 5) there is only half tr_itruncate reservation exist.  The idea
that from step 1) to step 7) there should be full tr_itruncate reservation exist is wrong.

You can’t after the first transaction is committed, the reservation for that transaction
is till there, right.

Thanks for your following comments, but they look based on the wrong idea.

Thanks,
Wengang

> 
>> Which is exactly what you theorized above.  Ok, I'm starting to be
>> convinced... :)
> 
> I'm not. :/
> 
>> I wonder, if you add this to the variable declarations in
>> xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about
>> UEK5 here):
>> 
>> struct xfs_trans_resv resv = M_RES(mp)->tr_itruncate;
>> 
>> and then change the xfs_trans_alloc call to:
>> 
>> resv.tr_logcount = 1;
>> error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp);
>> 
>> Does that solve the problem?
> 
> It might fix this specific case given the state of the log at the
> time the system crashed. However, it doesn't help the general
> case where whatever went wrong at runtime results in there being
> less than a single tr_itruncate reservation unit available in the
> log.
> 
> One of the recent RH custom cases I looked at had much less than a
> single tr_itruncate unit reservation when it came to recovering the
> EFI, so I'm not just making this up.
> 
> I really don't think this problem is not solvable at recovery time.
> if the runtime hang/crash leaves the EFI pinning the tail of the log
> and something has allowed the log to be overfull (i.e. log space
> used plus granted EFI reservation > physical log space), then we may
> not have any space at all for any sort of reservation during
> recovery.
> 
> I suspect that the cause of this recovery issue is the we require an
> overcommit of the log space accounting before we throttle incoming
> transaction reservations (i.e. the new reservation has to overrun
> before we throttle). I think that the reservation accounting overrun
> detection can race to the first item being placed on the wait list,
> which might allow transactions that should be throttled to escape
> resulting in a temporary log space overcommit. IF we crash at jsut
> the wrong time and an intent-in-progress pins the tail of the log,
> the overcommit situation can lead to recovery not having enough
> space for intent recovery reservations...
> 
> This subtle overcommit is one of the issues I have been trying to
> correct with the byte-based grant head accounting patches (which I'm
> -finally- getting back to). I know that replaying the log from the
> metadump that repoduces the hang with the byte-based grant head
> accounting patchset allowed log recovery to succeed. It has a
> different concept of where the head and tail are and hence how much
> log space is actually available at any given time, and that
> difference was just enough to allow a tr_itruncate reservation to
> succced. It also has different reservation grant overrun detection
> logic, but I'm not 100% sure if it solves this underlying runtime
> overcommit problem or not yet...
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Question: reserve log space at IO time for recover
  2023-07-19  0:11 ` Dave Chinner
  2023-07-19  1:44   ` Darrick J. Wong
@ 2023-07-19  1:46   ` Wengang Wang
  1 sibling, 0 replies; 22+ messages in thread
From: Wengang Wang @ 2023-07-19  1:46 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org, Srikanth C S



> On Jul 18, 2023, at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote:
> 
> On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote:
>> Hi,
>> 
>> I have a XFS metadump (was running with 4.14.35 plussing some back ported patches),
>> mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk
>> free journal space, while the transaction needs to reserve 360416 bytes to start the recovery.
>> Thus the mount hangs for ever.
> 
> Most likely something went wrong at runtime on the 4.14.35 kernel
> prior to the crash, leaving the on-disk state in an impossible to
> recover state. Likely an accounting leak in a transaction
> reservation somewhere, likely in passing the space used from the
> transaction to the CIL. We've had bugs in this area before, they
> eventually manifest in log hangs like this either at runtime or
> during recovery...
> 
>> That happens with 4.14.35 kernel and also upstream
>> kernel (6.4.0).
> 
> Upgrading the kernel won't fix recovery - it is likely that the
> journal state on disk is invalid and so the mount cannot complete 
> 
>> The is the related stack dumping (6.4.0 kernel):
>> 
>> [<0>] xlog_grant_head_wait+0xbd/0x200 [xfs]
>> [<0>] xlog_grant_head_check+0xd9/0x100 [xfs]
>> [<0>] xfs_log_reserve+0xbc/0x1e0 [xfs]
>> [<0>] xfs_trans_reserve+0x138/0x170 [xfs]
>> [<0>] xfs_trans_alloc+0xe8/0x220 [xfs]
>> [<0>] xfs_efi_item_recover+0x110/0x250 [xfs]
>> [<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs]
>> [<0>] xlog_recover_finish+0x33/0x310 [xfs]
>> [<0>] xfs_log_mount_finish+0xdb/0x160 [xfs]
>> [<0>] xfs_mountfs+0x51c/0x900 [xfs]
>> [<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs]
>> [<0>] get_tree_bdev+0x193/0x280
>> [<0>] vfs_get_tree+0x26/0xd0
>> [<0>] path_mount+0x69d/0x9b0
>> [<0>] do_mount+0x7d/0xa0
>> [<0>] __x64_sys_mount+0xdc/0x100
>> [<0>] do_syscall_64+0x3b/0x90
>> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
>> 
>> Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover
>> safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong).
> 
> Sure they do.
> 
> Log space usage is what the grant heads track; transactions are not
> allowed to start if there isn't both reserve and write grant head
> space available for them, and transaction rolls get held until there
> is write grant space available for them (i.e. they can block in
> xfs_trans_roll() -> xfs_trans_reserve() waiting for write grant head
> space).
> 
> There have been bugs in the grant head accounting mechanisms in the
> past, there may well still be bugs in it. But it is the grant head
> mechanisms that is supposed to guarantee there is always space in
> the journal for a transaction to commit, and by extension, ensure
> that we always have space in the journal for a transaction to be
> fully recovered.
> 
>> So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to
>> ensure log recovery safe?  The number of UR is determined by current un flushed log items.
>> It gets increased just after transaction is committed and gets decreased when log items are
>> flushed. With the UR, we are safe to have enough log space for the transactions used by log
>> recovery.
> 
> The grant heads already track log space usage and reservations like
> this. If you want to learn more about the nitty gritty details, look
> at this patch set that is aimed at changing how the grant heads
> track the used/reserved log space to improve performance:
> 
> https://lore.kernel.org/linux-xfs/20221220232308.3482960-1-david@fromorbit.com/

Thanks Dave a lot!
I will look more into the write head and above patch set.

Have a good day,
Wengang



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2023-08-27 16:05 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-18 22:57 Question: reserve log space at IO time for recover Wengang Wang
2023-07-19  0:11 ` Dave Chinner
2023-07-19  1:44   ` Darrick J. Wong
2023-07-19  6:25     ` Dave Chinner
2023-07-21 19:36       ` Wengang Wang
2023-07-24  0:57         ` Dave Chinner
2023-07-24 18:03           ` Wengang Wang
2023-07-26  4:08             ` Dave Chinner
2023-07-26 15:23               ` Darrick J. Wong
2023-07-27  1:05                 ` Dave Chinner
2023-07-28 17:56               ` Wengang Wang
2023-08-18  3:25                 ` Wengang Wang
2023-08-21 22:06                   ` Wengang Wang
2023-08-24  5:05                     ` Darrick J. Wong
2023-08-24 22:55                       ` Wengang Wang
2023-08-24  4:52                   ` Darrick J. Wong
2023-08-24  7:28                     ` Dave Chinner
2023-08-24 22:01                       ` Darrick J. Wong
2023-08-26  3:37                         ` Dave Chinner
2023-08-27 16:04                           ` Darrick J. Wong
2023-08-24 23:53                       ` Wengang Wang
2023-07-19  1:46   ` Wengang Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox