* Question: reserve log space at IO time for recover @ 2023-07-18 22:57 Wengang Wang 2023-07-19 0:11 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Wengang Wang @ 2023-07-18 22:57 UTC (permalink / raw) To: linux-xfs@vger.kernel.org; +Cc: Srikanth C S Hi, I have a XFS metadump (was running with 4.14.35 plussing some back ported patches), mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk free journal space, while the transaction needs to reserve 360416 bytes to start the recovery. Thus the mount hangs for ever. That happens with 4.14.35 kernel and also upstream kernel (6.4.0). The is the related stack dumping (6.4.0 kernel): [<0>] xlog_grant_head_wait+0xbd/0x200 [xfs] [<0>] xlog_grant_head_check+0xd9/0x100 [xfs] [<0>] xfs_log_reserve+0xbc/0x1e0 [xfs] [<0>] xfs_trans_reserve+0x138/0x170 [xfs] [<0>] xfs_trans_alloc+0xe8/0x220 [xfs] [<0>] xfs_efi_item_recover+0x110/0x250 [xfs] [<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs] [<0>] xlog_recover_finish+0x33/0x310 [xfs] [<0>] xfs_log_mount_finish+0xdb/0x160 [xfs] [<0>] xfs_mountfs+0x51c/0x900 [xfs] [<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs] [<0>] get_tree_bdev+0x193/0x280 [<0>] vfs_get_tree+0x26/0xd0 [<0>] path_mount+0x69d/0x9b0 [<0>] do_mount+0x7d/0xa0 [<0>] __x64_sys_mount+0xdc/0x100 [<0>] do_syscall_64+0x3b/0x90 [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong). So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to ensure log recovery safe? The number of UR is determined by current un flushed log items. It gets increased just after transaction is committed and gets decreased when log items are flushed. With the UR, we are safe to have enough log space for the transactions used by log recovery. thanks, wengang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-18 22:57 Question: reserve log space at IO time for recover Wengang Wang @ 2023-07-19 0:11 ` Dave Chinner 2023-07-19 1:44 ` Darrick J. Wong 2023-07-19 1:46 ` Wengang Wang 0 siblings, 2 replies; 22+ messages in thread From: Dave Chinner @ 2023-07-19 0:11 UTC (permalink / raw) To: Wengang Wang; +Cc: linux-xfs@vger.kernel.org, Srikanth C S On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote: > Hi, > > I have a XFS metadump (was running with 4.14.35 plussing some back ported patches), > mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk > free journal space, while the transaction needs to reserve 360416 bytes to start the recovery. > Thus the mount hangs for ever. Most likely something went wrong at runtime on the 4.14.35 kernel prior to the crash, leaving the on-disk state in an impossible to recover state. Likely an accounting leak in a transaction reservation somewhere, likely in passing the space used from the transaction to the CIL. We've had bugs in this area before, they eventually manifest in log hangs like this either at runtime or during recovery... > That happens with 4.14.35 kernel and also upstream > kernel (6.4.0). Upgrading the kernel won't fix recovery - it is likely that the journal state on disk is invalid and so the mount cannot complete > The is the related stack dumping (6.4.0 kernel): > > [<0>] xlog_grant_head_wait+0xbd/0x200 [xfs] > [<0>] xlog_grant_head_check+0xd9/0x100 [xfs] > [<0>] xfs_log_reserve+0xbc/0x1e0 [xfs] > [<0>] xfs_trans_reserve+0x138/0x170 [xfs] > [<0>] xfs_trans_alloc+0xe8/0x220 [xfs] > [<0>] xfs_efi_item_recover+0x110/0x250 [xfs] > [<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs] > [<0>] xlog_recover_finish+0x33/0x310 [xfs] > [<0>] xfs_log_mount_finish+0xdb/0x160 [xfs] > [<0>] xfs_mountfs+0x51c/0x900 [xfs] > [<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs] > [<0>] get_tree_bdev+0x193/0x280 > [<0>] vfs_get_tree+0x26/0xd0 > [<0>] path_mount+0x69d/0x9b0 > [<0>] do_mount+0x7d/0xa0 > [<0>] __x64_sys_mount+0xdc/0x100 > [<0>] do_syscall_64+0x3b/0x90 > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover > safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong). Sure they do. Log space usage is what the grant heads track; transactions are not allowed to start if there isn't both reserve and write grant head space available for them, and transaction rolls get held until there is write grant space available for them (i.e. they can block in xfs_trans_roll() -> xfs_trans_reserve() waiting for write grant head space). There have been bugs in the grant head accounting mechanisms in the past, there may well still be bugs in it. But it is the grant head mechanisms that is supposed to guarantee there is always space in the journal for a transaction to commit, and by extension, ensure that we always have space in the journal for a transaction to be fully recovered. > So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to > ensure log recovery safe? The number of UR is determined by current un flushed log items. > It gets increased just after transaction is committed and gets decreased when log items are > flushed. With the UR, we are safe to have enough log space for the transactions used by log > recovery. The grant heads already track log space usage and reservations like this. If you want to learn more about the nitty gritty details, look at this patch set that is aimed at changing how the grant heads track the used/reserved log space to improve performance: https://lore.kernel.org/linux-xfs/20221220232308.3482960-1-david@fromorbit.com/ Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-19 0:11 ` Dave Chinner @ 2023-07-19 1:44 ` Darrick J. Wong 2023-07-19 6:25 ` Dave Chinner 2023-07-19 1:46 ` Wengang Wang 1 sibling, 1 reply; 22+ messages in thread From: Darrick J. Wong @ 2023-07-19 1:44 UTC (permalink / raw) To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S On Wed, Jul 19, 2023 at 10:11:03AM +1000, Dave Chinner wrote: > On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote: > > Hi, > > > > I have a XFS metadump (was running with 4.14.35 plussing some back ported patches), > > mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk > > free journal space, while the transaction needs to reserve 360416 bytes to start the recovery. > > Thus the mount hangs for ever. > > Most likely something went wrong at runtime on the 4.14.35 kernel > prior to the crash, leaving the on-disk state in an impossible to > recover state. Likely an accounting leak in a transaction > reservation somewhere, likely in passing the space used from the > transaction to the CIL. We've had bugs in this area before, they > eventually manifest in log hangs like this either at runtime or > during recovery... > > > That happens with 4.14.35 kernel and also upstream > > kernel (6.4.0). > > Upgrading the kernel won't fix recovery - it is likely that the > journal state on disk is invalid and so the mount cannot complete Hmm. It'd be nice to know what the kernel thought it was doing when it went down. /me wonders if this has anything to do with the EFI recovery creating a transaction with tr_itruncate reservation because the log itself doesn't record the reservations of the active transactions. <begin handwaving> Let's say you have a 1000K log, a tr_write reservation is 100k, and a tr_itruncate reservations are 300k. In this case, you could theoretically have 10x tr_write transactions running concurrently; or you could have 3x tr_itruncate transactions running concurrently. Now let's say that someone fires up 10 programs that try to fpunch 10 separate files. Those ten threads will consume all the log grant space, unmap a block, and log an EFI. I think in reality tr_logcount means that 5 threads each consume (2*100k) grant space, but the point here is that we've used up all the log grant space. Then crash the system, having committed the first transaction of the two-transaction chain. Upon recovery, we'll find the 10x unfinished EFIs and pass them to EFI recovery. However, recovery creates a separate tr_itruncate transaction to finish each EFI. Now do we have a problem because the required log grant space is 300k * 10 = 3000k? It's late and I don't remember how recovery for non-intent items works quite well enough to think that scenario adds up. Maybe it was the case that before the system went down, the log had used 800K of the grant space for logged buffers and 100K for a single EFI logged in a tr_write transaction. Then we crashed, reloaded the 800K of stuff, and now we're trying to allocate 300K for a tr_itruncate to restart the EFI, but there's not enough log grant space? <muggawuggamuggawugga> --D > > The is the related stack dumping (6.4.0 kernel): > > > > [<0>] xlog_grant_head_wait+0xbd/0x200 [xfs] > > [<0>] xlog_grant_head_check+0xd9/0x100 [xfs] > > [<0>] xfs_log_reserve+0xbc/0x1e0 [xfs] > > [<0>] xfs_trans_reserve+0x138/0x170 [xfs] > > [<0>] xfs_trans_alloc+0xe8/0x220 [xfs] > > [<0>] xfs_efi_item_recover+0x110/0x250 [xfs] > > [<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs] > > [<0>] xlog_recover_finish+0x33/0x310 [xfs] > > [<0>] xfs_log_mount_finish+0xdb/0x160 [xfs] > > [<0>] xfs_mountfs+0x51c/0x900 [xfs] > > [<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs] > > [<0>] get_tree_bdev+0x193/0x280 > > [<0>] vfs_get_tree+0x26/0xd0 > > [<0>] path_mount+0x69d/0x9b0 > > [<0>] do_mount+0x7d/0xa0 > > [<0>] __x64_sys_mount+0xdc/0x100 > > [<0>] do_syscall_64+0x3b/0x90 > > [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > > > > Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover > > safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong). > > Sure they do. > > Log space usage is what the grant heads track; transactions are not > allowed to start if there isn't both reserve and write grant head > space available for them, and transaction rolls get held until there > is write grant space available for them (i.e. they can block in > xfs_trans_roll() -> xfs_trans_reserve() waiting for write grant head > space). > > There have been bugs in the grant head accounting mechanisms in the > past, there may well still be bugs in it. But it is the grant head > mechanisms that is supposed to guarantee there is always space in > the journal for a transaction to commit, and by extension, ensure > that we always have space in the journal for a transaction to be > fully recovered. > > > So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to > > ensure log recovery safe? The number of UR is determined by current un flushed log items. > > It gets increased just after transaction is committed and gets decreased when log items are > > flushed. With the UR, we are safe to have enough log space for the transactions used by log > > recovery. > > The grant heads already track log space usage and reservations like > this. If you want to learn more about the nitty gritty details, look > at this patch set that is aimed at changing how the grant heads > track the used/reserved log space to improve performance: > > https://lore.kernel.org/linux-xfs/20221220232308.3482960-1-david@fromorbit.com/ > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-19 1:44 ` Darrick J. Wong @ 2023-07-19 6:25 ` Dave Chinner 2023-07-21 19:36 ` Wengang Wang 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2023-07-19 6:25 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S On Tue, Jul 18, 2023 at 06:44:13PM -0700, Darrick J. Wong wrote: > On Wed, Jul 19, 2023 at 10:11:03AM +1000, Dave Chinner wrote: > > On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote: > > > Hi, > > > > > > I have a XFS metadump (was running with 4.14.35 plussing some back ported patches), > > > mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk > > > free journal space, while the transaction needs to reserve 360416 bytes to start the recovery. > > > Thus the mount hangs for ever. > > > > Most likely something went wrong at runtime on the 4.14.35 kernel > > prior to the crash, leaving the on-disk state in an impossible to > > recover state. Likely an accounting leak in a transaction > > reservation somewhere, likely in passing the space used from the > > transaction to the CIL. We've had bugs in this area before, they > > eventually manifest in log hangs like this either at runtime or > > during recovery... > > > > > That happens with 4.14.35 kernel and also upstream > > > kernel (6.4.0). > > > > Upgrading the kernel won't fix recovery - it is likely that the > > journal state on disk is invalid and so the mount cannot complete > > Hmm. It'd be nice to know what the kernel thought it was doing when it > went down. > > /me wonders if this has anything to do with the EFI recovery creating a > transaction with tr_itruncate reservation because the log itself doesn't > record the reservations of the active transactions. Possibly - it's been that way since 1994 but I don't recall it ever causing any issues in the past. That's not to say it's correct - I think it's wrong, but I think the whole transaction reservation calculation infrastructure needs a complete overhaul.... > > <begin handwaving> > > Let's say you have a 1000K log, a tr_write reservation is 100k, and a > tr_itruncate reservations are 300k. In this case, you could > theoretically have 10x tr_write transactions running concurrently; or > you could have 3x tr_itruncate transactions running concurrently. > > Now let's say that someone fires up 10 programs that try to fpunch 10 > separate files. Those ten threads will consume all the log grant space, > unmap a block, and log an EFI. I think in reality tr_logcount means > that 5 threads each consume (2*100k) grant space, but the point here is > that we've used up all the log grant space. > > Then crash the system, having committed the first transaction of the > two-transaction chain. > > Upon recovery, we'll find the 10x unfinished EFIs and pass them to EFI > recovery. However, recovery creates a separate tr_itruncate transaction > to finish each EFI. Now do we have a problem because the required log > grant space is 300k * 10 = 3000k? Hmmmm. That smells wrong. Can't put my finger on it ..... .... ah. Yeah. That. We only run one transaction at a time, and we commit the transaction after logging new intents and capturing the work that remains. So we return the unused part of the reservation (most of it) back to the log before we try to recover the next intent in the AIL. Hence we don't need (300k * 10) in the log to recover these EFIs as we don't hold all ten reservations at the same time (as we would have at runtime) - we need (log space used by recovery of intents + one reservation) to recover them all. Once we've replayed all the intents from the AIL and converted them into newly captured intents, they are removed from the AIL and that moves the tail of the log forwards. This frees up the entire of the log, and we then run the captured intents that still need to be processed. We run them one at a time to completion, committing them as we go, so again we only need space in the log for a single transaction reservation to complete recovery of the intent chaings. IOWs, because recovery of intents is single threaded, we only need to preserve space in the log for a single reservation to make forwards progress. > It's late and I don't remember how recovery for non-intent items works > quite well enough to think that scenario adds up. Maybe it was the case > that before the system went down, the log had used 800K of the grant > space for logged buffers and 100K for a single EFI logged in a tr_write > transaction. Then we crashed, reloaded the 800K of stuff, and now we're > trying to allocate 300K for a tr_itruncate to restart the EFI, but > there's not enough log grant space? Possible, if the EFI pins the tail of the log and the rest of the log is full. I've never seen that happen, and we've been using itrunc reservations in recovery since 1994, but that doesn't mean it can't happen. FWIW, I don't think this is EFI specific. BUI recovery use itrunc reservations, but what if it was an unlink operation using a remove reservation to free a directory block that logs a BUI? Same problem, different vector, right? I suspect what we really need is for all the intent processing to be restartable. We already have those for RUIs to relog/restart them there isn't enough reservation space available to complete processing the current RUI. We just made EFIs restartable to avoid busy extent deadlocks, we can easily extend that the same "enough reservation available" as we use for RUIs, etc. Do the same for BUIs and RUIs, and then... ... we can set up a reservation calculation for each intent type, and the reservation needed for a given chain of operations is the max of all the steps in the chain. Hence if we get part way through a chain and run out of reservation, we can restart the chain the the reservation we know is large enough to complete the remaining part of the chain. The new reservation may be smaller than the reservation that was held when we start the intent processing (because it's a rolling chain with an inherited log ticket), but this guarantees that we can reduce the reservation to the minimum required at any point in time if we are running low on log space.... This also gets around the problem of having to reserve enough space for N operations (e.g. 4 extents in an EFI) when the vast majority only use and need space for 1 operation. If we get an EFI with N extents in it, we can try a reservation for (N * efi_res) and if we can't get that we could just use efi_res and work through the EFI one extent at a time.... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-19 6:25 ` Dave Chinner @ 2023-07-21 19:36 ` Wengang Wang 2023-07-24 0:57 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Wengang Wang @ 2023-07-21 19:36 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S FYI: I am able reproduce the XFS mount hang issue with hacked kernels based on both 4.14.35 kernel or 6.4.0 kernel. The hacking patch is like this (based on 6.4.0 kernel): diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index f9e36b810663..e04e8738c564 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -491,6 +491,12 @@ xfs_extent_free_finish_item( struct xfs_extent_free_item *xefi; int error; + if (strcmp(tp->t_mountp->m_super->s_id,"loop0") == 0) { + pr_err("loop0 free extent enter sleeping 5 hours\n"); + msleep(3600*1000*5); + pr_err("loop0 free extent end sleeping 5 hours\n"); + } + xefi = container_of(item, struct xfs_extent_free_item, xefi_list); error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c index 7d4109af193e..b571fe6966af 100644 --- a/fs/xfs/xfs_trans_ail.c +++ b/fs/xfs/xfs_trans_ail.c @@ -557,6 +557,13 @@ xfsaild_push( xfs_trans_ail_cursor_done(&cur); spin_unlock(&ailp->ail_lock); + if (strcmp(ailp->ail_log->l_mp->m_super->s_id,"loop0") == 0) { + pr_err("loop0 aild enter sleeping 5 hours\n"); + msleep(3600*1000*5); // sleep 5 hours + pr_err("loop0 aild end sleeping 5 hours\n"); + pr_err("loop0 start issuing buffer IO\n"); + } + if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) ailp->ail_log_flush++; The hacking patch does two things: 1. sleeps 5 hours in the ext freeing deferred op before reserving log space. That simulates very slow extent free processing 2. sleeps 5 hours before flushing metadata buffers. That simulate very slow AIL processing. Though the hacking patch sleeps 5 hours, it actually needs only 10 minutes to reproduce this issue. Reproduce steps: 1. create a XFS with 10MiB log size (small so easier to reproduce). The following steps all aim at this XFS volume. 2. with un-hacked kernel, create 100K files, 1 block each file, under a dedicated directory. We will overwrite those files causing inode timestamp logging. A timestamp logging (tr_fsyncts) requires/resevers 2508 bytes, which is less than the required size for 360416 (tr_itrunc) for 4.14.35 kernel. Not able to get the exact number for 6.4.0, but should be similar. With those timestamp updating, we are able to make the free log size less than 360416. 3. With hacked kernel, create a new file, allocating block(s) to it and remove it. Because of the hacking patch, EFI would be logged but the EFD will not in 5 hours. We will panic the kernel in 5 hours so that with next mount, EFI recover would happen. 4. with another console, overwrite serially that 100K files we created in step2. This get stuck after overwritten 57K+ files. That's because we used up (almost) all the free log space and we pinned log tail position with the hacking patch. Checking vmcore, It is stuck waiting for 2508 log bytes (1552 free bytes available). The numbers are exactly same with 4.14.35 kernel and 6.4.0 kernel. 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 kernel and 6.4.0 kernel. It’s not same as the free log space as seen in vmcore, 1552 bytes. I am not able to figure oute why they are different yet. The delta aims to make log recover safe? 6. panic the kernel manually and reboot to un-hacked kernel. Verify the on disk free log space, it's the same as seen in step 5. 7. with un-hacked kernel, mount that xfs volume, it hang with both 4.14.35 kernel and 6.4.0 kernel. Checking vmcore, now the free log bytes now is the same as on disk free log bytes we found in step 6. That's for 4.14.25 kernel. I am not able to check the numbers for 6.4.0 kernel, but think it's same as seen in 4.14.35 kernel. So reproduced!! Checking back to the hacking patch, I am simulating very slow buffer flushing and extent freeing. I'd think its possible that the same things can get slow somehow in real production cases. So the hacking patch is reasonable. When log free bytes runs out and a kernel panic or manual panic/reboot, like echo c/b >/proc/sysrq-trigger happened, it's possible that even 6.4.0 kernel can run into the same sitiation as seen in 4.14.35 kernel. So 6.4.0 kernel needs fix too. I will look more on this. thanks, wengang > On Jul 18, 2023, at 11:25 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jul 18, 2023 at 06:44:13PM -0700, Darrick J. Wong wrote: >> On Wed, Jul 19, 2023 at 10:11:03AM +1000, Dave Chinner wrote: >>> On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote: >>>> Hi, >>>> >>>> I have a XFS metadump (was running with 4.14.35 plussing some back ported patches), >>>> mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk >>>> free journal space, while the transaction needs to reserve 360416 bytes to start the recovery. >>>> Thus the mount hangs for ever. >>> >>> Most likely something went wrong at runtime on the 4.14.35 kernel >>> prior to the crash, leaving the on-disk state in an impossible to >>> recover state. Likely an accounting leak in a transaction >>> reservation somewhere, likely in passing the space used from the >>> transaction to the CIL. We've had bugs in this area before, they >>> eventually manifest in log hangs like this either at runtime or >>> during recovery... >>> >>>> That happens with 4.14.35 kernel and also upstream >>>> kernel (6.4.0). >>> >>> Upgrading the kernel won't fix recovery - it is likely that the >>> journal state on disk is invalid and so the mount cannot complete >> >> Hmm. It'd be nice to know what the kernel thought it was doing when it >> went down. >> >> /me wonders if this has anything to do with the EFI recovery creating a >> transaction with tr_itruncate reservation because the log itself doesn't >> record the reservations of the active transactions. > > Possibly - it's been that way since 1994 but I don't recall it ever > causing any issues in the past. That's not to say it's correct - I > think it's wrong, but I think the whole transaction reservation > calculation infrastructure needs a complete overhaul.... > >> >> <begin handwaving> >> >> Let's say you have a 1000K log, a tr_write reservation is 100k, and a >> tr_itruncate reservations are 300k. In this case, you could >> theoretically have 10x tr_write transactions running concurrently; or >> you could have 3x tr_itruncate transactions running concurrently. >> >> Now let's say that someone fires up 10 programs that try to fpunch 10 >> separate files. Those ten threads will consume all the log grant space, >> unmap a block, and log an EFI. I think in reality tr_logcount means >> that 5 threads each consume (2*100k) grant space, but the point here is >> that we've used up all the log grant space. >> >> Then crash the system, having committed the first transaction of the >> two-transaction chain. >> >> Upon recovery, we'll find the 10x unfinished EFIs and pass them to EFI >> recovery. However, recovery creates a separate tr_itruncate transaction >> to finish each EFI. Now do we have a problem because the required log >> grant space is 300k * 10 = 3000k? > > Hmmmm. That smells wrong. Can't put my finger on it ..... > > > .... ah. Yeah. That. > > We only run one transaction at a time, and we commit the transaction > after logging new intents and capturing the work that remains. So we > return the unused part of the reservation (most of it) back to the > log before we try to recover the next intent in the AIL. > > Hence we don't need (300k * 10) in the log to recover these EFIs as > we don't hold all ten reservations at the same time (as we would > have at runtime) - we need (log space used by recovery of intents + > one reservation) to recover them all. > > Once we've replayed all the intents from the AIL and converted them > into newly captured intents, they are removed from the AIL and that > moves the tail of the log forwards. This frees up the entire of the > log, and we then run the captured intents that still need to be > processed. We run them one at a time to completion, committing them > as we go, so again we only need space in the log for a single > transaction reservation to complete recovery of the intent chaings. > > IOWs, because recovery of intents is single threaded, we only need > to preserve space in the log for a single reservation to make > forwards progress. > >> It's late and I don't remember how recovery for non-intent items works >> quite well enough to think that scenario adds up. Maybe it was the case >> that before the system went down, the log had used 800K of the grant >> space for logged buffers and 100K for a single EFI logged in a tr_write >> transaction. Then we crashed, reloaded the 800K of stuff, and now we're >> trying to allocate 300K for a tr_itruncate to restart the EFI, but >> there's not enough log grant space? > > Possible, if the EFI pins the tail of the log and the rest of the > log is full. I've never seen that happen, and we've been using > itrunc reservations in recovery since 1994, but that doesn't mean it > can't happen. > > FWIW, I don't think this is EFI specific. BUI recovery use itrunc > reservations, but what if it was an unlink operation using a remove > reservation to free a directory block that logs a BUI? Same problem, > different vector, right? > > I suspect what we really need is for all the intent processing to be > restartable. We already have those for RUIs to relog/restart them > there isn't enough reservation space available to complete > processing the current RUI. We just made EFIs restartable to avoid > busy extent deadlocks, we can easily extend that the same "enough > reservation available" as we use for RUIs, etc. Do the same for BUIs > and RUIs, and then... > > ... we can set up a reservation calculation for each intent type, > and the reservation needed for a given chain of operations is the > max of all the steps in the chain. Hence if we get part way through > a chain and run out of reservation, we can restart the chain the > the reservation we know is large enough to complete the remaining > part of the chain. > > The new reservation may be smaller than the reservation that was > held when we start the intent processing (because it's a rolling > chain with an inherited log ticket), but this guarantees that we can > reduce the reservation to the minimum required at any point in > time if we are running low on log space.... > > This also gets around the problem of having to reserve enough space > for N operations (e.g. 4 extents in an EFI) when the vast majority > only use and need space for 1 operation. If we get an EFI with N > extents in it, we can try a reservation for (N * efi_res) and if we > can't get that we could just use efi_res and work through the EFI > one extent at a time.... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-21 19:36 ` Wengang Wang @ 2023-07-24 0:57 ` Dave Chinner 2023-07-24 18:03 ` Wengang Wang 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2023-07-24 0:57 UTC (permalink / raw) To: Wengang Wang; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: > FYI: > > I am able reproduce the XFS mount hang issue with hacked kernels based on > both 4.14.35 kernel or 6.4.0 kernel. > > The hacking patch is like this (based on 6.4.0 kernel): > > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c > index f9e36b810663..e04e8738c564 100644 > --- a/fs/xfs/xfs_extfree_item.c > +++ b/fs/xfs/xfs_extfree_item.c > @@ -491,6 +491,12 @@ xfs_extent_free_finish_item( > struct xfs_extent_free_item *xefi; > int error; > > + if (strcmp(tp->t_mountp->m_super->s_id,"loop0") == 0) { > + pr_err("loop0 free extent enter sleeping 5 hours\n"); > + msleep(3600*1000*5); > + pr_err("loop0 free extent end sleeping 5 hours\n"); > + } > xefi = container_of(item, struct xfs_extent_free_item, xefi_list); > > error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); > diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c > index 7d4109af193e..b571fe6966af 100644 > --- a/fs/xfs/xfs_trans_ail.c > +++ b/fs/xfs/xfs_trans_ail.c > @@ -557,6 +557,13 @@ xfsaild_push( > xfs_trans_ail_cursor_done(&cur); > spin_unlock(&ailp->ail_lock); > > + if (strcmp(ailp->ail_log->l_mp->m_super->s_id,"loop0") == 0) { > + pr_err("loop0 aild enter sleeping 5 hours\n"); > + msleep(3600*1000*5); // sleep 5 hours > + pr_err("loop0 aild end sleeping 5 hours\n"); > + pr_err("loop0 start issuing buffer IO\n"); > + } > + > if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) > ailp->ail_log_flush++; > > The hacking patch does two things: > 1. sleeps 5 hours in the ext freeing deferred op before reserving log space. No it doesn't. The delay you added above occurs *after* the transaction has already guaranteed it has full log reservation for the extent freeing operation. You even use the transaction pointer to get the mount id (loop0) before stalling it.... > That simulates very slow extent free processing > 2. sleeps 5 hours before flushing metadata buffers. That simulate very slow > AIL processing. Well, it actually simulates *completely stalled* AIL processing. If it was simulating slow processing, then the tail of the log would always still be moving forwards slowly. Stalling metadata writeback for 5 hours is effectively hanging the journal for 5 hours, not making it "go slow". Those are very different situations that result in very different behaviours. > Though the hacking patch sleeps 5 hours, it actually needs only 10 minutes > to reproduce this issue. You should be able to fully stall the log and filesystem in under a second with this patch. > Reproduce steps: > > 1. create a XFS with 10MiB log size (small so easier to reproduce). The following > steps all aim at this XFS volume. Actually, make that a few milliseconds.... :) mkfs/xfs_info output would be appreciated. > 2. with un-hacked kernel, create 100K files, 1 block each file, under a dedicated > directory. We will overwrite those files causing inode timestamp logging. Ok, that's step 4, right? You're not doing that here? > A timestamp logging (tr_fsyncts) requires/resevers 2508 bytes, which is less > than the required size for 360416 (tr_itrunc) for 4.14.35 kernel. Not able to get > the exact number for 6.4.0, but should be similar. With those timestamp updating, > we are able to make the free log size less than 360416. > > 3. With hacked kernel, create a new file, allocating block(s) to it and remove it. > Because of the hacking patch, EFI would be logged but the EFD will not in 5 hours. > We will panic the kernel in 5 hours so that with next mount, EFI recover would > happen. Ok, so one EFI at the tail of the log, with a full reserve grant head and write grant head reservation for the extent free transaction that has been blocked. > 4. with another console, overwrite serially that 100K files we created in step2. > This get stuck after overwritten 57K+ files. That's because we used up (almost) > all the free log space and we pinned log tail position with the hacking patch. > Checking vmcore, It is stuck waiting for 2508 log bytes (1552 free bytes available). > The numbers are exactly same with 4.14.35 kernel and 6.4.0 kernel. Yup, and the reserve grant head should also include at least one full tr_itrunc reservation for the EFI in progress. > > 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 > kernel and 6.4.0 kernel. Which is is clearly wrong. It should be at least 360416 bytes (i.e tr_itrunc), because that's what the EFI being processed that pins the tail of the log is supposed to have reserved when it was stalled. So where has the ~180kB of leaked space come from? Have you traced the grant head reservations to find out what the runtime log space and grant head reservations actually are? i.e. we have full tracing of the log reservation accounting via tracepoints in the kernel. If there is a leak occurring, you need to capture a trace of all the reservation accounting operations and post process the output to find out what operation is leaking reserved space. e.g. # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo .... # trace-cmd report > s.t # head -3 s.t cpus=16 touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 # So this tells us the transaction reservation unit size, the count of reservations, the current reserve and grant head locations, and the current head and tail of the log at the time the transaction reservation is started and then after it completes. Hence we can tell if the reservation moved the grant head correctly, whether there was space available prior to the reservation being made, whether the grant heads contain in-memory reservations or whether they are empty (i.e. match the log tail), etc. All cases where we change the reservation and the grant heads are also captured by the above trace-cmd event filter, so a full picture of the grant head reservations vs log head/tail can be determined from post-processing the events.... This is how I've found log space leaks in the past. It's entirely possible that we are leaking a couple of bytes somewhere in the accounting, and that's the cause of the issue. It's happened before, and the accounting is complex enough it will most likely happen again... > It’s not same as the free log space as seen in vmcore, > 1552 bytes. I am not able to figure oute why they are different yet. The delta aims > to make log recover safe? Because different runtime conditions lead to different free log space state. > 6. panic the kernel manually and reboot to un-hacked kernel. Verify the on disk > free log space, it's the same as seen in step 5. Right, so something has gone wrong with the runtime accounting, and log recovery is just the messenger. We need to find out what went wrong at runtime, not hack about in recovery trying to handle a situation that should not be happening in the first place... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-24 0:57 ` Dave Chinner @ 2023-07-24 18:03 ` Wengang Wang 2023-07-26 4:08 ` Dave Chinner 0 siblings, 1 reply; 22+ messages in thread From: Wengang Wang @ 2023-07-24 18:03 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: >> FYI: >> >> I am able reproduce the XFS mount hang issue with hacked kernels based on >> both 4.14.35 kernel or 6.4.0 kernel. >> >> The hacking patch is like this (based on 6.4.0 kernel): >> >> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c >> index f9e36b810663..e04e8738c564 100644 >> --- a/fs/xfs/xfs_extfree_item.c >> +++ b/fs/xfs/xfs_extfree_item.c >> @@ -491,6 +491,12 @@ xfs_extent_free_finish_item( >> struct xfs_extent_free_item *xefi; >> int error; >> >> + if (strcmp(tp->t_mountp->m_super->s_id,"loop0") == 0) { >> + pr_err("loop0 free extent enter sleeping 5 hours\n"); >> + msleep(3600*1000*5); >> + pr_err("loop0 free extent end sleeping 5 hours\n"); >> + } >> xefi = container_of(item, struct xfs_extent_free_item, xefi_list); >> >> error = xfs_trans_free_extent(tp, EFD_ITEM(done), xefi); >> diff --git a/fs/xfs/xfs_trans_ail.c b/fs/xfs/xfs_trans_ail.c >> index 7d4109af193e..b571fe6966af 100644 >> --- a/fs/xfs/xfs_trans_ail.c >> +++ b/fs/xfs/xfs_trans_ail.c >> @@ -557,6 +557,13 @@ xfsaild_push( >> xfs_trans_ail_cursor_done(&cur); >> spin_unlock(&ailp->ail_lock); >> >> + if (strcmp(ailp->ail_log->l_mp->m_super->s_id,"loop0") == 0) { >> + pr_err("loop0 aild enter sleeping 5 hours\n"); >> + msleep(3600*1000*5); // sleep 5 hours >> + pr_err("loop0 aild end sleeping 5 hours\n"); >> + pr_err("loop0 start issuing buffer IO\n"); >> + } >> + >> if (xfs_buf_delwri_submit_nowait(&ailp->ail_buf_list)) >> ailp->ail_log_flush++; >> >> The hacking patch does two things: >> 1. sleeps 5 hours in the ext freeing deferred op before reserving log space. > > No it doesn't. The delay you added above occurs *after* the > transaction has already guaranteed it has full log reservation for > the extent freeing operation. You even use the transaction pointer > to get the mount id (loop0) before stalling it.... Right, I wanted to do so but made wrong patch. I should do it in the transaction rolling after committing the orig transaction and before reserving new resources. > >> That simulates very slow extent free processing > >> 2. sleeps 5 hours before flushing metadata buffers. That simulate > very slow >> AIL processing. > > Well, it actually simulates *completely stalled* AIL processing. > If it was simulating slow processing, then the tail of the log would > always still be moving forwards slowly. Stalling metadata writeback > for 5 hours is effectively hanging the journal for 5 hours, not > making it "go slow". Those are very different situations that result > in very different behaviours. > >> Though the hacking patch sleeps 5 hours, it actually needs only 10 minutes >> to reproduce this issue. > > You should be able to fully stall the log and filesystem in under a > second with this patch. > The script allocates transaction reserving resource and then commit the transaction then another round. I didn’t want there are pending (with resources reserved but not committed) transaction(s) exist when the free log bytes exhausts. If there are enough pending transactions in memory, the on-disk free log bytes could be enough for EFI recover. So the the script didn’t use up (almost) all the free log bytes that quickly. The script can be improved, well, it already indicates things. >> Reproduce steps: >> >> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following >> steps all aim at this XFS volume. > > Actually, make that a few milliseconds.... :) > :) > mkfs/xfs_info output would be appreciated. sure, # xfs_info 20GB.bk2 meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks = sectsz=512 attr=2, projid32bit=1 = crc=0 finobt=0, sparse=0, rmapbt=0 = reflink=0 data = bsize=4096 blocks=5242880, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=2560, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 > >> 2. with un-hacked kernel, create 100K files, 1 block each file, under a dedicated >> directory. > We will overwrite those files causing inode timestamp logging. > > Ok, that's step 4, right? You're not doing that here? Yes, that’s step 4. > >> A timestamp logging (tr_fsyncts) requires/resevers 2508 bytes, which is less >> than the > required size for 360416 (tr_itrunc) for 4.14.35 kernel. Not able to get >> the exact number for 6.4.0, but should be similar. With those timestamp updating, >> we are able to make the free log size less than > 360416. >> >> 3. With hacked kernel, create a new file, allocating block(s) to it and remove it. >> Because of the hacking patch, EFI would be logged but the EFD will not in 5 hours. >> We will panic the kernel in 5 hours so that with next mount, EFI recover would >> happen. > > Ok, so one EFI at the tail of the log, with a full reserve grant > head and write grant head reservation for the extent free > transaction that has been blocked. Yes, with current patch (Though I didn’t want the reservation). > >> 4. with another console, overwrite serially that 100K files we created in step2. >> This get stuck after overwritten 57K+ files. That's because we used up (almost) >> all the free log space and we pinned log tail position with the hacking patch. >> > Checking vmcore, It is stuck waiting for 2508 log bytes (1552 free bytes available). >> The numbers are exactly same with 4.14.35 kernel and 6.4.0 kernel. > > Yup, and the reserve grant head should also include at least one > full tr_itrunc reservation for the EFI in progress. Yes (though I didn’t want that tr_itrunc reservation). > >> >> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 >> kernel and 6.4.0 kernel. > > Which is is clearly wrong. It should be at least 360416 bytes (i.e > tr_itrunc), because that's what the EFI being processed that pins > the tail of the log is supposed to have reserved when it was > stalled. Yep, exactly. > So where has the ~180kB of leaked space come from? > > Have you traced the grant head reservations to find out > what the runtime log space and grant head reservations actually are? I have the numbers in vmcore (ignore the WARNs), 74 l_reserve_head 75 head position: cycle: 8 bytes: 2978800 76 tail position: cycle: 7 bytes: 2980352 77 lastSync pos : cycle: 8 bytes: 2797568 78 available_size with head: 1552 bytes 79 1 log space reservers waiting 80 DEBUG: *(struct xlog_ticket *)0xffff9ba296dc68a0 = { 81 .t_queue = (struct list_head){ 82 .next = (struct list_head *)0xffff9ba2cfaabd88, 83 .prev = (struct list_head *)0xffff9ba2cfaabd88, 84 }, 85 .t_task = (struct task_struct *)0xffff9ba280fc4180, 86 .t_tid = (xlog_tid_t)1461406111, 87 .t_ref = (atomic_t){ 88 .counter = (int)1, 89 }, 90 .t_curr_res = (int)2508, 91 .t_unit_res = (int)2508, 92 .t_ocnt = (char)0, 93 .t_cnt = (char)0, 94 .t_flags = (uint16_t)0, 95 .t_iclog_hdrs = (int)1, 96 } 97 process 126306 needs: 2508 bytes 98 previous waiters need: 2508 bytes 99 WARN: Pending transactions not checked (to save time) 100 0 log bytes reserved for active transactions (this FS) 101 Reserved and to be reserved less than log size, seems xlog.l_tail_lsn not updated some how 102 l_write_head 103 head position: cycle: 8 bytes: 2978800 104 tail position: cycle: 7 bytes: 2980352 105 lastSync pos : cycle: 8 bytes: 2797568 106 available_size with head: 1552 bytes 107 0 log space reservers waiting 108 WARN: Pending transactions not checked (to save time) 109 0 log bytes reserved for active transactions (this FS) > > i.e. we have full tracing of the log reservation accounting via > tracepoints in the kernel. If there is a leak occurring, you need to > capture a trace of all the reservation accounting operations and > post process the output to find out what operation is leaking > reserved space. e.g. > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo > .... > # trace-cmd report > s.t > # head -3 s.t > cpus=16 > touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > # > > So this tells us the transaction reservation unit size, the count of > reservations, the current reserve and grant head locations, and the > current head and tail of the log at the time the transaction > reservation is started and then after it completes. Will do that and report back. You want full log or only some typical ones? Full log would be big, how shall I share? > > Hence we can tell if the reservation moved the grant head correctly, > whether there was space available prior to the reservation being > made, whether the grant heads contain in-memory reservations or > whether they are empty (i.e. match the log tail), etc. > > All cases where we change the reservation and the grant heads are > also captured by the above trace-cmd event filter, so a full picture > of the grant head reservations vs log head/tail can be determined > from post-processing the events.... > > This is how I've found log space leaks in the past. It's entirely > possible that we are leaking a couple of bytes somewhere in the > accounting, and that's the cause of the issue. It's happened before, > and the accounting is complex enough it will most likely happen > again... > >> It’s not same as the free log space as seen in vmcore, >> 1552 bytes. > I am not able to figure oute why they are different yet. The delta aims >> to make log recover safe? > > Because different runtime conditions lead to different free log space > state. > >> 6. panic the kernel manually and reboot to un-hacked kernel. Verify the on disk >> free log space, it's the same as seen in step 5. > > Right, so something has gone wrong with the runtime accounting, and > log recovery is just the messenger. We need to find out what went > wrong at runtime, not hack about in recovery trying to handle a > situation that should not be happening in the first place... > Yes, agreed. (I was/am not chasing to fix it in recovery). thanks, wengang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-24 18:03 ` Wengang Wang @ 2023-07-26 4:08 ` Dave Chinner 2023-07-26 15:23 ` Darrick J. Wong 2023-07-28 17:56 ` Wengang Wang 0 siblings, 2 replies; 22+ messages in thread From: Dave Chinner @ 2023-07-26 4:08 UTC (permalink / raw) To: Wengang Wang; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: > >> FYI: > >> > >> I am able reproduce the XFS mount hang issue with hacked kernels based on > >> both 4.14.35 kernel or 6.4.0 kernel. > >> Reproduce steps: > >> > >> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following > >> steps all aim at this XFS volume. > > > > Actually, make that a few milliseconds.... :) > > :) > > > mkfs/xfs_info output would be appreciated. > > sure, > # xfs_info 20GB.bk2 > meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks > = sectsz=512 attr=2, projid32bit=1 > = crc=0 finobt=0, sparse=0, rmapbt=0 > = reflink=0 Hmmmm. Why are you only testing v4 filesystems? They are deprecated and support is largely due to be dropped from upstream in 2025... Does the same problem occur with a v5 filesystems? > >> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 > >> kernel and 6.4.0 kernel. > > > > Which is is clearly wrong. It should be at least 360416 bytes (i.e > > tr_itrunc), because that's what the EFI being processed that pins > > the tail of the log is supposed to have reserved when it was > > stalled. > > Yep, exactly. > > > So where has the ~180kB of leaked space come from? > > > > Have you traced the grant head reservations to find out > > what the runtime log space and grant head reservations actually are? > I have the numbers in vmcore (ignore the WARNs), That's not what I'm asking. You've dumped the values at the time of the hang, not traced the runtime reservations that have been made. > > i.e. we have full tracing of the log reservation accounting via > > tracepoints in the kernel. If there is a leak occurring, you need to > > capture a trace of all the reservation accounting operations and > > post process the output to find out what operation is leaking > > reserved space. e.g. > > > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo > > .... > > # trace-cmd report > s.t > > # head -3 s.t > > cpus=16 > > touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > > # > > > > So this tells us the transaction reservation unit size, the count of > > reservations, the current reserve and grant head locations, and the > > current head and tail of the log at the time the transaction > > reservation is started and then after it completes. > > Will do that and report back. You want full log or only some typical > ones? Full log would be big, how shall I share? I don't want to see the log. It'll be huge - I regularly generate traces containing gigabytes of log accounting traces like this from a single workload. What I'm asking you to do is run the tracing and then post process the values from the trace to determine what operation is using more space than is being freed back to the log. I generally do this with grep, awk and sed. some people use python or perl. But either way it's a *lot* of work - in the past I have spent _weeks_ on trace analysis to find a 4 byte leak in the log space accounting. DOing things like graphing the head, tail and grant spaces over time tend to show if this is a gradual leak versus a sudden step change. If it's a sudden step change, then you can isolate it in the trace and work out what happened. If it's a gradual change, then you need to start looking for accounting discrepancies... e.g. a transaction records 32 bytes used in the item, so it releases t_unit - 32 bytes at commit. However, the CIL may then only track 28 bytes of space for the item in the journal and we leak 4 bytes of reservation on every on of those items committed. These sorts of leaks typically only add up to being somethign significant in situations where the log is flooded with tiny inode timestamp changes - 4 bytes iper item doesn't really matter when you only have a few thousand items in the log, but when you have hundreds of thousands of tiny items in the log... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-26 4:08 ` Dave Chinner @ 2023-07-26 15:23 ` Darrick J. Wong 2023-07-27 1:05 ` Dave Chinner 2023-07-28 17:56 ` Wengang Wang 1 sibling, 1 reply; 22+ messages in thread From: Darrick J. Wong @ 2023-07-26 15:23 UTC (permalink / raw) To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S On Wed, Jul 26, 2023 at 02:08:05PM +1000, Dave Chinner wrote: > On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: > > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: > > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: > > >> FYI: > > >> > > >> I am able reproduce the XFS mount hang issue with hacked kernels based on > > >> both 4.14.35 kernel or 6.4.0 kernel. > > >> Reproduce steps: > > >> > > >> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following > > >> steps all aim at this XFS volume. > > > > > > Actually, make that a few milliseconds.... :) > > > > :) > > > > > mkfs/xfs_info output would be appreciated. > > > > sure, > > # xfs_info 20GB.bk2 > > meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks > > = sectsz=512 attr=2, projid32bit=1 > > = crc=0 finobt=0, sparse=0, rmapbt=0 > > = reflink=0 > > Hmmmm. Why are you only testing v4 filesystems? They are deprecated > and support is largely due to be dropped from upstream in 2025... > > Does the same problem occur with a v5 filesystems? > > > >> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 > > >> kernel and 6.4.0 kernel. > > > > > > Which is is clearly wrong. It should be at least 360416 bytes (i.e > > > tr_itrunc), because that's what the EFI being processed that pins > > > the tail of the log is supposed to have reserved when it was > > > stalled. > > > > Yep, exactly. > > > > > So where has the ~180kB of leaked space come from? > > > > > > Have you traced the grant head reservations to find out > > > what the runtime log space and grant head reservations actually are? > > I have the numbers in vmcore (ignore the WARNs), > > That's not what I'm asking. You've dumped the values at the time of > the hang, not traced the runtime reservations that have been made. > > > > i.e. we have full tracing of the log reservation accounting via > > > tracepoints in the kernel. If there is a leak occurring, you need to > > > capture a trace of all the reservation accounting operations and > > > post process the output to find out what operation is leaking > > > reserved space. e.g. > > > > > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo > > > .... > > > # trace-cmd report > s.t > > > # head -3 s.t > > > cpus=16 > > > touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > > > > # > > > > > > So this tells us the transaction reservation unit size, the count of > > > reservations, the current reserve and grant head locations, and the > > > current head and tail of the log at the time the transaction > > > reservation is started and then after it completes. > > > > Will do that and report back. You want full log or only some typical > > ones? Full log would be big, how shall I share? > > I don't want to see the log. It'll be huge - I regularly generate > traces containing gigabytes of log accounting traces like this from > a single workload. > > What I'm asking you to do is run the tracing and then post process > the values from the trace to determine what operation is using more > space than is being freed back to the log. > > I generally do this with grep, awk and sed. some people use python > or perl. But either way it's a *lot* of work - in the past I have > spent _weeks_ on trace analysis to find a 4 byte leak in the log > space accounting. DOing things like graphing the head, tail and grant > spaces over time tend to show if this is a gradual leak versus a > sudden step change. If it's a sudden step change, then you can > isolate it in the trace and work out what happened. If it's a > gradual change, then you need to start looking for accounting > discrepancies... Any chance you'd be willing to share that pipeline? It'd be useful to stash that kind of debugging program in xfsprogs/tools to save time and eliminate an entire class of "Hey Dave, did I capture this correctly?" questions. (At least until someone changes the tracepoints :P) --D > e.g. a transaction records 32 bytes used in the item, so it releases > t_unit - 32 bytes at commit. However, the CIL may then only track 28 > bytes of space for the item in the journal and we leak 4 bytes of > reservation on every on of those items committed. > > These sorts of leaks typically only add up to being somethign > significant in situations where the log is flooded with tiny inode > timestamp changes - 4 bytes iper item doesn't really matter when you > only have a few thousand items in the log, but when you have > hundreds of thousands of tiny items in the log... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-26 15:23 ` Darrick J. Wong @ 2023-07-27 1:05 ` Dave Chinner 0 siblings, 0 replies; 22+ messages in thread From: Dave Chinner @ 2023-07-27 1:05 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S On Wed, Jul 26, 2023 at 08:23:20AM -0700, Darrick J. Wong wrote: > On Wed, Jul 26, 2023 at 02:08:05PM +1000, Dave Chinner wrote: > > On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: > > > > On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: > > > > On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: > > > > i.e. we have full tracing of the log reservation accounting via > > > > tracepoints in the kernel. If there is a leak occurring, you need to > > > > capture a trace of all the reservation accounting operations and > > > > post process the output to find out what operation is leaking > > > > reserved space. e.g. > > > > > > > > # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo > > > > .... > > > > # trace-cmd report > s.t > > > > # head -3 s.t > > > > cpus=16 > > > > touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > > touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > > > > > > > > # > > > > > > > > So this tells us the transaction reservation unit size, the count of > > > > reservations, the current reserve and grant head locations, and the > > > > current head and tail of the log at the time the transaction > > > > reservation is started and then after it completes. > > > > > > Will do that and report back. You want full log or only some typical > > > ones? Full log would be big, how shall I share? > > > > I don't want to see the log. It'll be huge - I regularly generate > > traces containing gigabytes of log accounting traces like this from > > a single workload. > > > > What I'm asking you to do is run the tracing and then post process > > the values from the trace to determine what operation is using more > > space than is being freed back to the log. > > > > I generally do this with grep, awk and sed. some people use python > > or perl. But either way it's a *lot* of work - in the past I have > > spent _weeks_ on trace analysis to find a 4 byte leak in the log > > space accounting. DOing things like graphing the head, tail and grant > > spaces over time tend to show if this is a gradual leak versus a > > sudden step change. If it's a sudden step change, then you can > > isolate it in the trace and work out what happened. If it's a > > gradual change, then you need to start looking for accounting > > discrepancies... > > Any chance you'd be willing to share that pipeline? It's all just one-off scripts I tend to write from scratch each time. I'm always looking for different things, and I never know what I'm looking for ahead of time, so the parser is always different.... > It'd be useful to > stash that kind of debugging program in xfsprogs/tools to save time and > eliminate an entire class of "Hey Dave, did I capture this correctly?" > questions. > > (At least until someone changes the tracepoints :P) That happens frequently enough... :/ Regardless, it's the process of writing filter scripts that helps me find the problem I'm looking for, not the actual filter script itself. e.g. I've been working on a similar "EFI reservation deadlocks on log space during recovery" issue over the past couple of days from a customer metadump. The filesystem had a 2GB log, and it was full. Log recovery took a several minutes to run there were so many items to recover. The (incomplete) logprint of all the op headers was over a million lines (somewhere around 400k individual log items) and I know that logprint output skipped at least two checkpoints that had over 65,000 ophdrs in them each. So log recovery was recovering roughly half a million log items from the log, and only then it was hanging on the first EFI that needed to be recovered. The mount trace was somewhere around 20GB of event data at the point of the hang, the logprint output indicated about 1.3 million ophdrs in the log that were being processed by recovery. I worked that out to be around half a million individual log items in the log that needed recovery. In setting up a regex to filter the trace down to something managable, I noticed that the id of the EFI that hung recovery matched an EFI in the very first checkpoint being recovered from the log (from the logprint output). Logprint told me that this checkpoint was almost entirely EFI/EFDs from an unlinked inode being inactivated. The last two log items in that checkpoint were the inode item and a single EFI. A quick grep failed to find that inode anywhere else in the logprint, and the EFI id didn't show up in any EFD item, either. Pulling them out of the mount trace confirmed that they only appeared in recovery of the initial checkpoint. IOWs, the EFI - and the inode that was locked across the processing of the EFI at runtime - had pinned the tail of the log and that's why recovery hung waiting on log space. It turns out that the hang in log recovery followed a system reboot due to a hung task timeout. So right now, it looks to me like the runtime code had deadlocked the journal because transaction reservation during a roll couldn't make progress because the inode held locked across the transaction roll pinned the tail of the log.... How that happened, I have no idea, but it was the _process_ of writing filters to analysis the logprint data and log reservation traces that allowed me to find this, not because I had a set of canned scripts that exposed that explicit information. Even if I write a scanned script to find unmatched EFIs in a log print, the log print output is sufficiently unreliable that I couldn't rely on the output of the filter script anyway..... Fundamentally, I don't have the time to run this trace+filter process this for every report of similar problems - it takes a long time just to work out what what I need to look for in the traces. Hence it's important that more people learn the process - how to read and filter the traces to isolate something that might be going wrong. I don't have canned scripts because the scripts I write are guided by what I think I need to search for, and I regularly get to a dead end and have to start the filtering process all over again to look for something else entirely. Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-26 4:08 ` Dave Chinner 2023-07-26 15:23 ` Darrick J. Wong @ 2023-07-28 17:56 ` Wengang Wang 2023-08-18 3:25 ` Wengang Wang 1 sibling, 1 reply; 22+ messages in thread From: Wengang Wang @ 2023-07-28 17:56 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S > On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: >>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: >>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: >>>> FYI: >>>> >>>> I am able reproduce the XFS mount hang issue with hacked kernels based on >>>> both 4.14.35 kernel or 6.4.0 kernel. >>>> Reproduce steps: >>>> >>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following >>>> steps all aim at this XFS volume. >>> >>> Actually, make that a few milliseconds.... :) >> >> :) >> >>> mkfs/xfs_info output would be appreciated. >> >> sure, >> # xfs_info 20GB.bk2 >> meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks >> = sectsz=512 attr=2, projid32bit=1 >> = crc=0 finobt=0, sparse=0, rmapbt=0 >> = reflink=0 > > Hmmmm. Why are you only testing v4 filesystems? They are deprecated > and support is largely due to be dropped from upstream in 2025... > Ha, it just happened to be so. I was trying to reproduce it in the same environment as customer has — that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format v4 filesystems. I created the xfs image in a file on OL7 and copied the image file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here. > Does the same problem occur with a v5 filesystems? Will test and report back. > >>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 >>>> kernel and 6.4.0 kernel. >>> >>> Which is is clearly wrong. It should be at least 360416 bytes (i.e >>> tr_itrunc), because that's what the EFI being processed that pins >>> the tail of the log is supposed to have reserved when it was >>> stalled. >> >> Yep, exactly. >> >>> So where has the ~180kB of leaked space come from? >>> >>> Have you traced the grant head reservations to find out >>> what the runtime log space and grant head reservations actually are? >> I have the numbers in vmcore (ignore the WARNs), > > That's not what I'm asking. You've dumped the values at the time of > the hang, not traced the runtime reservations that have been made. > >>> i.e. we have full tracing of the log reservation accounting via >>> tracepoints in the kernel. If there is a leak occurring, you need to >>> capture a trace of all the reservation accounting operations and >>> post process the output to find out what operation is leaking >>> reserved space. e.g. >>> >>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo >>> .... >>> # trace-cmd report > s.t >>> # head -3 s.t >>> cpus=16 >>> touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>> touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>> >>> # >>> >>> So this tells us the transaction reservation unit size, the count of >>> reservations, the current reserve and grant head locations, and the >>> current head and tail of the log at the time the transaction >>> reservation is started and then after it completes. >> >> Will do that and report back. You want full log or only some typical >> ones? Full log would be big, how shall I share? > > I don't want to see the log. It'll be huge - I regularly generate > traces containing gigabytes of log accounting traces like this from > a single workload. > > What I'm asking you to do is run the tracing and then post process > the values from the trace to determine what operation is using more > space than is being freed back to the log. > > I generally do this with grep, awk and sed. some people use python > or perl. But either way it's a *lot* of work - in the past I have > spent _weeks_ on trace analysis to find a 4 byte leak in the log > space accounting. DOing things like graphing the head, tail and grant > spaces over time tend to show if this is a gradual leak versus a > sudden step change. If it's a sudden step change, then you can > isolate it in the trace and work out what happened. If it's a > gradual change, then you need to start looking for accounting > discrepancies... > > e.g. a transaction records 32 bytes used in the item, so it releases > t_unit - 32 bytes at commit. However, the CIL may then only track 28 > bytes of space for the item in the journal and we leak 4 bytes of > reservation on every on of those items committed. > > These sorts of leaks typically only add up to being somethign > significant in situations where the log is flooded with tiny inode > timestamp changes - 4 bytes iper item doesn't really matter when you > only have a few thousand items in the log, but when you have > hundreds of thousands of tiny items in the log... > OK. will work more on this. # I am going to start a two-week vacation, and will then continue on this when back. thanks, wengang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-28 17:56 ` Wengang Wang @ 2023-08-18 3:25 ` Wengang Wang 2023-08-21 22:06 ` Wengang Wang 2023-08-24 4:52 ` Darrick J. Wong 0 siblings, 2 replies; 22+ messages in thread From: Wengang Wang @ 2023-08-18 3:25 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S What things looks to be: For the file deletion. log bytes are reserved basing on xfs_mount->tr_itruncate which is: { tr_logres = 175488, tr_logcount = 2, tr_logflags = XFS_TRANS_PERM_LOG_RES, } You see it’s permanent log reservation with two log operations (two transactions in rolling mode). After calculation (xlog_calc_unit_res(), adding space for various log headers), the final log space needed per transaction changes from 175488 to 180208 bytes. So the total log space needed is 360416 (180208 * 2). Above log space (360416 bytes) needs to be reserved for both run time inode removing (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()). RUNTIME ======== For run time inode removing. The first transaction is mainly used for inode fields change. The second transaction is used for intents including extent freeing. For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved for the coming transaction). The first transaction is committed, writing some bytes to log and releasing the left reserved bytes. Now we have the second transaction which has 180208 log bytes reserved too. The second transaction is supposed to process intents including extent freeing. With my hacking patch, I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. With my test case, other transactions (update timestamps) then happen. As my hacking patch pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available log space (in memory in on disk). And finally the on disk (and in memory) available log space goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free) transaction. Panic the kernel and remount the xfs volume LOG RECOVER ============= With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available. THINKING ======== Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have (a bit) more than 180208 available log bytes on disk, so the reservation goes and the recovery goes. That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res xfs_mount->tr_ext_free { tr_logres = 175488, tr_logcount = 0, tr_logflags = 0, } and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it). thanks, wengang > On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > > > >> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote: >> >> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: >>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: >>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: >>>>> FYI: >>>>> >>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on >>>>> both 4.14.35 kernel or 6.4.0 kernel. >>>>> Reproduce steps: >>>>> >>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following >>>>> steps all aim at this XFS volume. >>>> >>>> Actually, make that a few milliseconds.... :) >>> >>> :) >>> >>>> mkfs/xfs_info output would be appreciated. >>> >>> sure, >>> # xfs_info 20GB.bk2 >>> meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks >>> = sectsz=512 attr=2, projid32bit=1 >>> = crc=0 finobt=0, sparse=0, rmapbt=0 >>> = reflink=0 >> >> Hmmmm. Why are you only testing v4 filesystems? They are deprecated >> and support is largely due to be dropped from upstream in 2025... >> > > Ha, it just happened to be so. > I was trying to reproduce it in the same environment as customer has — > that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format > v4 filesystems. I created the xfs image in a file on OL7 and copied the image > file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here. > >> Does the same problem occur with a v5 filesystems? > > Will test and report back. > >> >>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 >>>>> kernel and 6.4.0 kernel. >>>> >>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e >>>> tr_itrunc), because that's what the EFI being processed that pins >>>> the tail of the log is supposed to have reserved when it was >>>> stalled. >>> >>> Yep, exactly. >>> >>>> So where has the ~180kB of leaked space come from? >>>> >>>> Have you traced the grant head reservations to find out >>>> what the runtime log space and grant head reservations actually are? >>> I have the numbers in vmcore (ignore the WARNs), >> >> That's not what I'm asking. You've dumped the values at the time of >> the hang, not traced the runtime reservations that have been made. >> >>>> i.e. we have full tracing of the log reservation accounting via >>>> tracepoints in the kernel. If there is a leak occurring, you need to >>>> capture a trace of all the reservation accounting operations and >>>> post process the output to find out what operation is leaking >>>> reserved space. e.g. >>>> >>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo >>>> .... >>>> # trace-cmd report > s.t >>>> # head -3 s.t >>>> cpus=16 >>>> touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>>> touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>>> >>>> # >>>> >>>> So this tells us the transaction reservation unit size, the count of >>>> reservations, the current reserve and grant head locations, and the >>>> current head and tail of the log at the time the transaction >>>> reservation is started and then after it completes. >>> >>> Will do that and report back. You want full log or only some typical >>> ones? Full log would be big, how shall I share? >> >> I don't want to see the log. It'll be huge - I regularly generate >> traces containing gigabytes of log accounting traces like this from >> a single workload. >> >> What I'm asking you to do is run the tracing and then post process >> the values from the trace to determine what operation is using more >> space than is being freed back to the log. >> >> I generally do this with grep, awk and sed. some people use python >> or perl. But either way it's a *lot* of work - in the past I have >> spent _weeks_ on trace analysis to find a 4 byte leak in the log >> space accounting. DOing things like graphing the head, tail and grant >> spaces over time tend to show if this is a gradual leak versus a >> sudden step change. If it's a sudden step change, then you can >> isolate it in the trace and work out what happened. If it's a >> gradual change, then you need to start looking for accounting >> discrepancies... >> >> e.g. a transaction records 32 bytes used in the item, so it releases >> t_unit - 32 bytes at commit. However, the CIL may then only track 28 >> bytes of space for the item in the journal and we leak 4 bytes of >> reservation on every on of those items committed. >> >> These sorts of leaks typically only add up to being somethign >> significant in situations where the log is flooded with tiny inode >> timestamp changes - 4 bytes iper item doesn't really matter when you >> only have a few thousand items in the log, but when you have >> hundreds of thousands of tiny items in the log... >> > > OK. will work more on this. > # I am going to start a two-week vacation, and will then continue on this when back. > > thanks, > wengang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-18 3:25 ` Wengang Wang @ 2023-08-21 22:06 ` Wengang Wang 2023-08-24 5:05 ` Darrick J. Wong 2023-08-24 4:52 ` Darrick J. Wong 1 sibling, 1 reply; 22+ messages in thread From: Wengang Wang @ 2023-08-21 22:06 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S > On Aug 17, 2023, at 8:25 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > > What things looks to be: > > For the file deletion. log bytes are reserved basing on xfs_mount->tr_itruncate which is: > { > tr_logres = 175488, > tr_logcount = 2, > tr_logflags = XFS_TRANS_PERM_LOG_RES, > } > You see it’s permanent log reservation with two log operations (two transactions in rolling mode). > After calculation (xlog_calc_unit_res(), adding space for various log headers), the final > log space needed per transaction changes from 175488 to 180208 bytes. So the total > log space needed is 360416 (180208 * 2). > Above log space (360416 bytes) needs to be reserved for both run time inode removing > (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()). > > RUNTIME > ======== > > For run time inode removing. The first transaction is mainly used for inode fields change. > The second transaction is used for intents including extent freeing. > > For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved > for the coming transaction). > The first transaction is committed, writing some bytes to log and releasing the left reserved bytes. > > Now we have the second transaction which has 180208 log bytes reserved too. The second > transaction is supposed to process intents including extent freeing. With my hacking patch, > I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. > > With my test case, other transactions (update timestamps) then happen. As my hacking patch > pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available > log space (in memory in on disk). And finally the on disk (and in memory) available log space > goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free) > transaction. > > Panic the kernel and remount the xfs volume > > LOG RECOVER > ============= > > With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs > 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available. > > THINKING > ======== > Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd > transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have > (a bit) more than 180208 available log bytes on disk, so the reservation goes and the recovery goes. > That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res > xfs_mount->tr_ext_free > { > tr_logres = 175488, > tr_logcount = 0, > tr_logflags = 0, > } > and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it). > The following patch recovers the problematic XFS volume by my hacked kernel and the also the one from customer. commit 19fad903e213717a92f8b94fe2c0c68b6a6ee7f7 (HEAD -> 35587163_fix) Author: Wengang Wang <wen.gang.wang@oracle.com> Date: Mon Aug 21 15:03:58 2023 -0700 xfs: reserve less log space when recovering EFIs Currently tr_itruncate is used for both run time truncating and boot time EFI recovery. tr_itruncate { tr_logres = 175488, tr_logcount = 2, tr_logflags = XFS_TRANS_PERM_LOG_RES, } Is a permanent two-transaction series. Actually only the second transaction is really used to free extents and that needs half of the log space reservation from tr_itruncate. For EFI recovery, the things to do is just to free extents, so it doesn't needs full log space reservation by tr_itruncate. It needs half of it and shouldn't need more than that. Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index f1a5ecf099aa..428984e48d23 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -667,6 +667,7 @@ xfs_efi_item_recover( int i; int error = 0; bool requeue_only = false; + struct xfs_trans_res tres; /* * First check the validity of the extents described by the @@ -683,7 +684,10 @@ xfs_efi_item_recover( } } - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); + tres.tr_logres = M_RES(mp)->tr_itruncate.tr_logres; + tres.tr_logcount = 0; + tres.tr_logflags = 0; + error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp); if (error) return error; efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); thanks, wengang > thanks, > wengang > >> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: >> >> >> >>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote: >>> >>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: >>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: >>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: >>>>>> FYI: >>>>>> >>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on >>>>>> both 4.14.35 kernel or 6.4.0 kernel. >>>>>> Reproduce steps: >>>>>> >>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following >>>>>> steps all aim at this XFS volume. >>>>> >>>>> Actually, make that a few milliseconds.... :) >>>> >>>> :) >>>> >>>>> mkfs/xfs_info output would be appreciated. >>>> >>>> sure, >>>> # xfs_info 20GB.bk2 >>>> meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks >>>> = sectsz=512 attr=2, projid32bit=1 >>>> = crc=0 finobt=0, sparse=0, rmapbt=0 >>>> = reflink=0 >>> >>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated >>> and support is largely due to be dropped from upstream in 2025... >>> >> >> Ha, it just happened to be so. >> I was trying to reproduce it in the same environment as customer has — >> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format >> v4 filesystems. I created the xfs image in a file on OL7 and copied the image >> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here. >> >>> Does the same problem occur with a v5 filesystems? >> >> Will test and report back. >> >>> >>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 >>>>>> kernel and 6.4.0 kernel. >>>>> >>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e >>>>> tr_itrunc), because that's what the EFI being processed that pins >>>>> the tail of the log is supposed to have reserved when it was >>>>> stalled. >>>> >>>> Yep, exactly. >>>> >>>>> So where has the ~180kB of leaked space come from? >>>>> >>>>> Have you traced the grant head reservations to find out >>>>> what the runtime log space and grant head reservations actually are? >>>> I have the numbers in vmcore (ignore the WARNs), >>> >>> That's not what I'm asking. You've dumped the values at the time of >>> the hang, not traced the runtime reservations that have been made. >>> >>>>> i.e. we have full tracing of the log reservation accounting via >>>>> tracepoints in the kernel. If there is a leak occurring, you need to >>>>> capture a trace of all the reservation accounting operations and >>>>> post process the output to find out what operation is leaking >>>>> reserved space. e.g. >>>>> >>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo >>>>> .... >>>>> # trace-cmd report > s.t >>>>> # head -3 s.t >>>>> cpus=16 >>>>> touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>>>> touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>>>> >>>>> # >>>>> >>>>> So this tells us the transaction reservation unit size, the count of >>>>> reservations, the current reserve and grant head locations, and the >>>>> current head and tail of the log at the time the transaction >>>>> reservation is started and then after it completes. >>>> >>>> Will do that and report back. You want full log or only some typical >>>> ones? Full log would be big, how shall I share? >>> >>> I don't want to see the log. It'll be huge - I regularly generate >>> traces containing gigabytes of log accounting traces like this from >>> a single workload. >>> >>> What I'm asking you to do is run the tracing and then post process >>> the values from the trace to determine what operation is using more >>> space than is being freed back to the log. >>> >>> I generally do this with grep, awk and sed. some people use python >>> or perl. But either way it's a *lot* of work - in the past I have >>> spent _weeks_ on trace analysis to find a 4 byte leak in the log >>> space accounting. DOing things like graphing the head, tail and grant >>> spaces over time tend to show if this is a gradual leak versus a >>> sudden step change. If it's a sudden step change, then you can >>> isolate it in the trace and work out what happened. If it's a >>> gradual change, then you need to start looking for accounting >>> discrepancies... >>> >>> e.g. a transaction records 32 bytes used in the item, so it releases >>> t_unit - 32 bytes at commit. However, the CIL may then only track 28 >>> bytes of space for the item in the journal and we leak 4 bytes of >>> reservation on every on of those items committed. >>> >>> These sorts of leaks typically only add up to being somethign >>> significant in situations where the log is flooded with tiny inode >>> timestamp changes - 4 bytes iper item doesn't really matter when you >>> only have a few thousand items in the log, but when you have >>> hundreds of thousands of tiny items in the log... >>> >> >> OK. will work more on this. >> # I am going to start a two-week vacation, and will then continue on this when back. >> >> thanks, >> wengang > > ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-21 22:06 ` Wengang Wang @ 2023-08-24 5:05 ` Darrick J. Wong 2023-08-24 22:55 ` Wengang Wang 0 siblings, 1 reply; 22+ messages in thread From: Darrick J. Wong @ 2023-08-24 5:05 UTC (permalink / raw) To: Wengang Wang; +Cc: Dave Chinner, linux-xfs@vger.kernel.org, Srikanth C S On Mon, Aug 21, 2023 at 10:06:45PM +0000, Wengang Wang wrote: > > > > On Aug 17, 2023, at 8:25 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > > > > What things looks to be: > > > > For the file deletion. log bytes are reserved basing on xfs_mount->tr_itruncate which is: > > { > > tr_logres = 175488, > > tr_logcount = 2, > > tr_logflags = XFS_TRANS_PERM_LOG_RES, > > } > > You see it’s permanent log reservation with two log operations (two transactions in rolling mode). > > After calculation (xlog_calc_unit_res(), adding space for various log headers), the final > > log space needed per transaction changes from 175488 to 180208 bytes. So the total > > log space needed is 360416 (180208 * 2). > > Above log space (360416 bytes) needs to be reserved for both run time inode removing > > (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()). > > > > RUNTIME > > ======== > > > > For run time inode removing. The first transaction is mainly used for inode fields change. > > The second transaction is used for intents including extent freeing. > > > > For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved > > for the coming transaction). > > The first transaction is committed, writing some bytes to log and releasing the left reserved bytes. > > > > Now we have the second transaction which has 180208 log bytes reserved too. The second > > transaction is supposed to process intents including extent freeing. With my hacking patch, > > I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. > > > > With my test case, other transactions (update timestamps) then happen. As my hacking patch > > pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available > > log space (in memory in on disk). And finally the on disk (and in memory) available log space > > goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free) > > transaction. > > > > Panic the kernel and remount the xfs volume > > > > LOG RECOVER > > ============= > > > > With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs > > 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available. > > > > THINKING > > ======== > > Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd > > transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have > > (a bit) more than 180208 available log bytes on disk, so the reservation goes and the recovery goes. > > That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res > > xfs_mount->tr_ext_free > > { > > tr_logres = 175488, > > tr_logcount = 0, > > tr_logflags = 0, > > } > > and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it). > > > > The following patch recovers the problematic XFS volume by my hacked kernel and the also > the one from customer. > > commit 19fad903e213717a92f8b94fe2c0c68b6a6ee7f7 (HEAD -> 35587163_fix) > Author: Wengang Wang <wen.gang.wang@oracle.com> > Date: Mon Aug 21 15:03:58 2023 -0700 > > xfs: reserve less log space when recovering EFIs > > Currently tr_itruncate is used for both run time truncating and > boot time EFI recovery. tr_itruncate > { > tr_logres = 175488, > tr_logcount = 2, > tr_logflags = XFS_TRANS_PERM_LOG_RES, > } > > Is a permanent two-transaction series. Actually only the second transaction > is really used to free extents and that needs half of the log space reservation > from tr_itruncate. > > For EFI recovery, the things to do is just to free extents, so it doesn't > needs full log space reservation by tr_itruncate. It needs half of it and > shouldn't need more than that. > > Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> > > diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c > index f1a5ecf099aa..428984e48d23 100644 > --- a/fs/xfs/xfs_extfree_item.c > +++ b/fs/xfs/xfs_extfree_item.c u> @@ -667,6 +667,7 @@ xfs_efi_item_recover( > int i; > int error = 0; > bool requeue_only = false; > + struct xfs_trans_res tres; > > /* > * First check the validity of the extents described by the > @@ -683,7 +684,10 @@ xfs_efi_item_recover( > } > } > > - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); > + tres.tr_logres = M_RES(mp)->tr_itruncate.tr_logres; > + tres.tr_logcount = 0; HAH that was fast. I'm glad it worked. > + tres.tr_logflags = 0; Though I think we should preserve XFS_TRANS_PERM_LOG_RES since other parts of the codebase check that the flag was conveyed from the transaction reservation into tp->t_flags itself. I think for an upstream patch we'd rather fix all of them, and in a systemic way. How about adding this to xfs_log_recover.h: /* * Transform a regular reservation into one suitable for recovery of a * log intent item. * * Intent recovery only runs a single step of the transaction chain and * defers the rest to a separate transaction. Therefore, we reduce * logcount to 1 here to avoid livelocks if the log grant space is * nearly exhausted due to the recovered intent pinning the tail. Keep * the same logflags to avoid tripping asserts elsewhere. Struct copies * abound below. */ static inline struct xfs_trans_res xlog_recover_resv(const struct xfs_trans_res *template) { struct xfs_trans_res ret = { .tr_logres = template->tr_logres, .tr_logcount = 1, .tr_logflags = template->tr_logflags, }; return ret; } and then this becomes: struct xfs_trans_res resv; resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); which should simplify fixing this for the other recovery functions? --D > + error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp); > if (error) > return error; > efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); > > thanks, > wengang > > > thanks, > > wengang > > > >> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > >> > >> > >> > >>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote: > >>> > >>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: > >>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: > >>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: > >>>>>> FYI: > >>>>>> > >>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on > >>>>>> both 4.14.35 kernel or 6.4.0 kernel. > >>>>>> Reproduce steps: > >>>>>> > >>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following > >>>>>> steps all aim at this XFS volume. > >>>>> > >>>>> Actually, make that a few milliseconds.... :) > >>>> > >>>> :) > >>>> > >>>>> mkfs/xfs_info output would be appreciated. > >>>> > >>>> sure, > >>>> # xfs_info 20GB.bk2 > >>>> meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks > >>>> = sectsz=512 attr=2, projid32bit=1 > >>>> = crc=0 finobt=0, sparse=0, rmapbt=0 > >>>> = reflink=0 > >>> > >>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated > >>> and support is largely due to be dropped from upstream in 2025... > >>> > >> > >> Ha, it just happened to be so. > >> I was trying to reproduce it in the same environment as customer has — > >> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format > >> v4 filesystems. I created the xfs image in a file on OL7 and copied the image > >> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here. > >> > >>> Does the same problem occur with a v5 filesystems? > >> > >> Will test and report back. > >> > >>> > >>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 > >>>>>> kernel and 6.4.0 kernel. > >>>>> > >>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e > >>>>> tr_itrunc), because that's what the EFI being processed that pins > >>>>> the tail of the log is supposed to have reserved when it was > >>>>> stalled. > >>>> > >>>> Yep, exactly. > >>>> > >>>>> So where has the ~180kB of leaked space come from? > >>>>> > >>>>> Have you traced the grant head reservations to find out > >>>>> what the runtime log space and grant head reservations actually are? > >>>> I have the numbers in vmcore (ignore the WARNs), > >>> > >>> That's not what I'm asking. You've dumped the values at the time of > >>> the hang, not traced the runtime reservations that have been made. > >>> > >>>>> i.e. we have full tracing of the log reservation accounting via > >>>>> tracepoints in the kernel. If there is a leak occurring, you need to > >>>>> capture a trace of all the reservation accounting operations and > >>>>> post process the output to find out what operation is leaking > >>>>> reserved space. e.g. > >>>>> > >>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo > >>>>> .... > >>>>> # trace-cmd report > s.t > >>>>> # head -3 s.t > >>>>> cpus=16 > >>>>> touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > >>>>> touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > >>>>> > >>>>> # > >>>>> > >>>>> So this tells us the transaction reservation unit size, the count of > >>>>> reservations, the current reserve and grant head locations, and the > >>>>> current head and tail of the log at the time the transaction > >>>>> reservation is started and then after it completes. > >>>> > >>>> Will do that and report back. You want full log or only some typical > >>>> ones? Full log would be big, how shall I share? > >>> > >>> I don't want to see the log. It'll be huge - I regularly generate > >>> traces containing gigabytes of log accounting traces like this from > >>> a single workload. > >>> > >>> What I'm asking you to do is run the tracing and then post process > >>> the values from the trace to determine what operation is using more > >>> space than is being freed back to the log. > >>> > >>> I generally do this with grep, awk and sed. some people use python > >>> or perl. But either way it's a *lot* of work - in the past I have > >>> spent _weeks_ on trace analysis to find a 4 byte leak in the log > >>> space accounting. DOing things like graphing the head, tail and grant > >>> spaces over time tend to show if this is a gradual leak versus a > >>> sudden step change. If it's a sudden step change, then you can > >>> isolate it in the trace and work out what happened. If it's a > >>> gradual change, then you need to start looking for accounting > >>> discrepancies... > >>> > >>> e.g. a transaction records 32 bytes used in the item, so it releases > >>> t_unit - 32 bytes at commit. However, the CIL may then only track 28 > >>> bytes of space for the item in the journal and we leak 4 bytes of > >>> reservation on every on of those items committed. > >>> > >>> These sorts of leaks typically only add up to being somethign > >>> significant in situations where the log is flooded with tiny inode > >>> timestamp changes - 4 bytes iper item doesn't really matter when you > >>> only have a few thousand items in the log, but when you have > >>> hundreds of thousands of tiny items in the log... > >>> > >> > >> OK. will work more on this. > >> # I am going to start a two-week vacation, and will then continue on this when back. > >> > >> thanks, > >> wengang > > > > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-24 5:05 ` Darrick J. Wong @ 2023-08-24 22:55 ` Wengang Wang 0 siblings, 0 replies; 22+ messages in thread From: Wengang Wang @ 2023-08-24 22:55 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Dave Chinner, linux-xfs@vger.kernel.org, Srikanth C S > On Aug 23, 2023, at 10:05 PM, Darrick J. Wong <djwong@kernel.org> wrote: > > On Mon, Aug 21, 2023 at 10:06:45PM +0000, Wengang Wang wrote: >> >> >>> On Aug 17, 2023, at 8:25 PM, Wengang Wang <wen.gang.wang@oracle.com> wrote: >>> >>> What things looks to be: >>> >>> For the file deletion. log bytes are reserved basing on xfs_mount->tr_itruncate which is: >>> { >>> tr_logres = 175488, >>> tr_logcount = 2, >>> tr_logflags = XFS_TRANS_PERM_LOG_RES, >>> } >>> You see it’s permanent log reservation with two log operations (two transactions in rolling mode). >>> After calculation (xlog_calc_unit_res(), adding space for various log headers), the final >>> log space needed per transaction changes from 175488 to 180208 bytes. So the total >>> log space needed is 360416 (180208 * 2). >>> Above log space (360416 bytes) needs to be reserved for both run time inode removing >>> (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()). >>> >>> RUNTIME >>> ======== >>> >>> For run time inode removing. The first transaction is mainly used for inode fields change. >>> The second transaction is used for intents including extent freeing. >>> >>> For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved >>> for the coming transaction). >>> The first transaction is committed, writing some bytes to log and releasing the left reserved bytes. >>> >>> Now we have the second transaction which has 180208 log bytes reserved too. The second >>> transaction is supposed to process intents including extent freeing. With my hacking patch, >>> I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. >>> >>> With my test case, other transactions (update timestamps) then happen. As my hacking patch >>> pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available >>> log space (in memory in on disk). And finally the on disk (and in memory) available log space >>> goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free) >>> transaction. >>> >>> Panic the kernel and remount the xfs volume >>> >>> LOG RECOVER >>> ============= >>> >>> With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs >>> 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available. >>> >>> THINKING >>> ======== >>> Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd >>> transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have >>> (a bit) more than 180208 available log bytes on disk, so the reservation goes and the recovery goes. >>> That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res >>> xfs_mount->tr_ext_free >>> { >>> tr_logres = 175488, >>> tr_logcount = 0, >>> tr_logflags = 0, >>> } >>> and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it). >>> >> >> The following patch recovers the problematic XFS volume by my hacked kernel and the also >> the one from customer. >> >> commit 19fad903e213717a92f8b94fe2c0c68b6a6ee7f7 (HEAD -> 35587163_fix) >> Author: Wengang Wang <wen.gang.wang@oracle.com> >> Date: Mon Aug 21 15:03:58 2023 -0700 >> >> xfs: reserve less log space when recovering EFIs >> >> Currently tr_itruncate is used for both run time truncating and >> boot time EFI recovery. tr_itruncate >> { >> tr_logres = 175488, >> tr_logcount = 2, >> tr_logflags = XFS_TRANS_PERM_LOG_RES, >> } >> >> Is a permanent two-transaction series. Actually only the second transaction >> is really used to free extents and that needs half of the log space reservation >> from tr_itruncate. >> >> For EFI recovery, the things to do is just to free extents, so it doesn't >> needs full log space reservation by tr_itruncate. It needs half of it and >> shouldn't need more than that. >> >> Signed-off-by: Wengang Wang <wen.gang.wang@oracle.com> >> >> diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c >> index f1a5ecf099aa..428984e48d23 100644 >> --- a/fs/xfs/xfs_extfree_item.c >> +++ b/fs/xfs/xfs_extfree_item.c > u> @@ -667,6 +667,7 @@ xfs_efi_item_recover( >> int i; >> int error = 0; >> bool requeue_only = false; >> + struct xfs_trans_res tres; >> >> /* >> * First check the validity of the extents described by the >> @@ -683,7 +684,10 @@ xfs_efi_item_recover( >> } >> } >> >> - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); >> + tres.tr_logres = M_RES(mp)->tr_itruncate.tr_logres; >> + tres.tr_logcount = 0; > > HAH that was fast. I'm glad it worked. > >> + tres.tr_logflags = 0; > > Though I think we should preserve XFS_TRANS_PERM_LOG_RES since other > parts of the codebase check that the flag was conveyed from the > transaction reservation into tp->t_flags itself. > Sure, I could do that. I just wondering the meaning of XFS_TRANS_PERM_LOG_RES. I am thinking it means that the to be reserved log bytes would be used for multiple transactions in a chain. And the tr_logcount indicates the number of the transactions in the chain. So my understand is that a single transaction is not a chain (surely, you can understand it as a one-transaction chain). So XFS_TRANS_PERM_LOG_RES is not needed. It that a correct understanding? > I think for an upstream patch we'd rather fix all of them, and in a > systemic way. How about adding this to xfs_log_recover.h: > > /* > * Transform a regular reservation into one suitable for recovery of a > * log intent item. > * > * Intent recovery only runs a single step of the transaction chain and > * defers the rest to a separate transaction. Therefore, we reduce > * logcount to 1 here to avoid livelocks if the log grant space is > * nearly exhausted due to the recovered intent pinning the tail. Keep > * the same logflags to avoid tripping asserts elsewhere. Struct copies > * abound below. > */ > static inline struct xfs_trans_res > xlog_recover_resv(const struct xfs_trans_res *template) > { > struct xfs_trans_res ret = { > .tr_logres = template->tr_logres, > .tr_logcount = 1, > .tr_logflags = template->tr_logflags, > }; > > return ret; > } > > and then this becomes: > > struct xfs_trans_res resv; > > resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); > error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); > > which should simplify fixing this for the other recovery functions? > Yes, I could do that. thanks, wengang > --D > >> + error = xfs_trans_alloc(mp, &tres, 0, 0, 0, &tp); > >> if (error) >> return error; >> efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); >> >> thanks, >> wengang >> >>> thanks, >>> wengang >>> >>>> On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: >>>> >>>> >>>> >>>>> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote: >>>>> >>>>> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: >>>>>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: >>>>>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: >>>>>>>> FYI: >>>>>>>> >>>>>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on >>>>>>>> both 4.14.35 kernel or 6.4.0 kernel. >>>>>>>> Reproduce steps: >>>>>>>> >>>>>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following >>>>>>>> steps all aim at this XFS volume. >>>>>>> >>>>>>> Actually, make that a few milliseconds.... :) >>>>>> >>>>>> :) >>>>>> >>>>>>> mkfs/xfs_info output would be appreciated. >>>>>> >>>>>> sure, >>>>>> # xfs_info 20GB.bk2 >>>>>> meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks >>>>>> = sectsz=512 attr=2, projid32bit=1 >>>>>> = crc=0 finobt=0, sparse=0, rmapbt=0 >>>>>> = reflink=0 >>>>> >>>>> Hmmmm. Why are you only testing v4 filesystems? They are deprecated >>>>> and support is largely due to be dropped from upstream in 2025... >>>>> >>>> >>>> Ha, it just happened to be so. >>>> I was trying to reproduce it in the same environment as customer has — >>>> that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format >>>> v4 filesystems. I created the xfs image in a file on OL7 and copied the image >>>> file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here. >>>> >>>>> Does the same problem occur with a v5 filesystems? >>>> >>>> Will test and report back. >>>> >>>>> >>>>>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 >>>>>>>> kernel and 6.4.0 kernel. >>>>>>> >>>>>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e >>>>>>> tr_itrunc), because that's what the EFI being processed that pins >>>>>>> the tail of the log is supposed to have reserved when it was >>>>>>> stalled. >>>>>> >>>>>> Yep, exactly. >>>>>> >>>>>>> So where has the ~180kB of leaked space come from? >>>>>>> >>>>>>> Have you traced the grant head reservations to find out >>>>>>> what the runtime log space and grant head reservations actually are? >>>>>> I have the numbers in vmcore (ignore the WARNs), >>>>> >>>>> That's not what I'm asking. You've dumped the values at the time of >>>>> the hang, not traced the runtime reservations that have been made. >>>>> >>>>>>> i.e. we have full tracing of the log reservation accounting via >>>>>>> tracepoints in the kernel. If there is a leak occurring, you need to >>>>>>> capture a trace of all the reservation accounting operations and >>>>>>> post process the output to find out what operation is leaking >>>>>>> reserved space. e.g. >>>>>>> >>>>>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo >>>>>>> .... >>>>>>> # trace-cmd report > s.t >>>>>>> # head -3 s.t >>>>>>> cpus=16 >>>>>>> touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>>>>>> touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 >>>>>>> >>>>>>> # >>>>>>> >>>>>>> So this tells us the transaction reservation unit size, the count of >>>>>>> reservations, the current reserve and grant head locations, and the >>>>>>> current head and tail of the log at the time the transaction >>>>>>> reservation is started and then after it completes. >>>>>> >>>>>> Will do that and report back. You want full log or only some typical >>>>>> ones? Full log would be big, how shall I share? >>>>> >>>>> I don't want to see the log. It'll be huge - I regularly generate >>>>> traces containing gigabytes of log accounting traces like this from >>>>> a single workload. >>>>> >>>>> What I'm asking you to do is run the tracing and then post process >>>>> the values from the trace to determine what operation is using more >>>>> space than is being freed back to the log. >>>>> >>>>> I generally do this with grep, awk and sed. some people use python >>>>> or perl. But either way it's a *lot* of work - in the past I have >>>>> spent _weeks_ on trace analysis to find a 4 byte leak in the log >>>>> space accounting. DOing things like graphing the head, tail and grant >>>>> spaces over time tend to show if this is a gradual leak versus a >>>>> sudden step change. If it's a sudden step change, then you can >>>>> isolate it in the trace and work out what happened. If it's a >>>>> gradual change, then you need to start looking for accounting >>>>> discrepancies... >>>>> >>>>> e.g. a transaction records 32 bytes used in the item, so it releases >>>>> t_unit - 32 bytes at commit. However, the CIL may then only track 28 >>>>> bytes of space for the item in the journal and we leak 4 bytes of >>>>> reservation on every on of those items committed. >>>>> >>>>> These sorts of leaks typically only add up to being somethign >>>>> significant in situations where the log is flooded with tiny inode >>>>> timestamp changes - 4 bytes iper item doesn't really matter when you >>>>> only have a few thousand items in the log, but when you have >>>>> hundreds of thousands of tiny items in the log... >>>>> >>>> >>>> OK. will work more on this. >>>> # I am going to start a two-week vacation, and will then continue on this when back. >>>> >>>> thanks, >>>> wengang ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-18 3:25 ` Wengang Wang 2023-08-21 22:06 ` Wengang Wang @ 2023-08-24 4:52 ` Darrick J. Wong 2023-08-24 7:28 ` Dave Chinner 1 sibling, 1 reply; 22+ messages in thread From: Darrick J. Wong @ 2023-08-24 4:52 UTC (permalink / raw) To: Wengang Wang; +Cc: Dave Chinner, linux-xfs@vger.kernel.org, Srikanth C S On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote: > What things looks to be: > > For the file deletion. log bytes are reserved basing on xfs_mount->tr_itruncate which is: > { > tr_logres = 175488, > tr_logcount = 2, > tr_logflags = XFS_TRANS_PERM_LOG_RES, > } > You see it’s permanent log reservation with two log operations (two transactions in rolling mode). > After calculation (xlog_calc_unit_res(), adding space for various log headers), the final > log space needed per transaction changes from 175488 to 180208 bytes. So the total > log space needed is 360416 (180208 * 2). > Above log space (360416 bytes) needs to be reserved for both run time inode removing > (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover()). > > RUNTIME > ======== > > For run time inode removing. The first transaction is mainly used for inode fields change. > The second transaction is used for intents including extent freeing. > > For the first transaction, it has 180208 reserved log bytes (another 180208 bytes reserved > for the coming transaction). > The first transaction is committed, writing some bytes to log and releasing the left reserved bytes. > > Now we have the second transaction which has 180208 log bytes reserved too. The second > transaction is supposed to process intents including extent freeing. With my hacking patch, > I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. > > With my test case, other transactions (update timestamps) then happen. As my hacking patch > pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available > log space (in memory in on disk). And finally the on disk (and in memory) available log space > goes down near to 180208 bytes. Those 180208 bytes are reserved by above second (extent-free) > transaction. > > Panic the kernel and remount the xfs volume > > LOG RECOVER > ============= > > With log recover, during EFI recover, we use tr_itruncate again to reserve two transactions that needs > 360416 log bytes. Reserving 360416 bytes fails (blocks) because we now only have about 180208 available. > > THINKING > ======== > Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd > transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have > (a bit) more than 180208 available log bytes on disk, so the reservation goes and the recovery goes. > That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res > xfs_mount->tr_ext_free > { > tr_logres = 175488, > tr_logcount = 0, > tr_logflags = 0, > } > and use tr_ext_free instead of tr_itruncate in EFI recover. (didn’t try it). Hmm. The log intent item recovery functions (e.g. xfs_efi_item_recover) take the recovered log item and perform *exactly* one step of the transaction chain. If there's more work to do, the xfs_defer_ops_capture_and_commit machinery captures the defer ops, block reservation, and transaction reservation for later. Later, in xlog_finish_defer_ops, we "rehydrate" the old transaction and block reservation, attach the defer ops, and finish the chain: /* * Create a new transaction reservation from the * captured information. Set logcount to 1 to force the * new transaction to regrant every roll so that we can * make forward progress in recovery no matter how full * the log might be. */ resv.tr_logres = dfc->dfc_logres; resv.tr_logcount = 1; resv.tr_logflags = XFS_TRANS_PERM_LOG_RES; error = xfs_trans_alloc(mp, &resv, dfc->dfc_blkres, dfc->dfc_rtxres, XFS_TRANS_RESERVE, &tp); if (error) { xlog_force_shutdown(mp->m_log, SHUTDOWN_LOG_IO_ERROR); return error; } /* * Transfer to this new transaction all the dfops we * captured from recovering a single intent item. */ list_del_init(&dfc->dfc_list); xfs_defer_ops_continue(dfc, tp, &dres); error = xfs_trans_commit(tp); Since xfs_efi_item_recover is only performing one step of what could be a chain of deferred updates, it never rolls the transaction that it creates. It therefore only requires the amount of grant space that you'd get with tr_logcount == 1. It is therefore a bit silly that we ask for more than that, and in bad cases like this, hang log recovery needlessly. Which is exactly what you theorized above. Ok, I'm starting to be convinced... :) I wonder, if you add this to the variable declarations in xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about UEK5 here): struct xfs_trans_resv resv = M_RES(mp)->tr_itruncate; and then change the xfs_trans_alloc call to: resv.tr_logcount = 1; error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); Does that solve the problem? --D > thanks, > wengang > > > On Jul 28, 2023, at 10:56 AM, Wengang Wang <wen.gang.wang@oracle.com> wrote: > > > > > > > >> On Jul 25, 2023, at 9:08 PM, Dave Chinner <david@fromorbit.com> wrote: > >> > >> On Mon, Jul 24, 2023 at 06:03:02PM +0000, Wengang Wang wrote: > >>>> On Jul 23, 2023, at 5:57 PM, Dave Chinner <david@fromorbit.com> wrote: > >>>> On Fri, Jul 21, 2023 at 07:36:03PM +0000, Wengang Wang wrote: > >>>>> FYI: > >>>>> > >>>>> I am able reproduce the XFS mount hang issue with hacked kernels based on > >>>>> both 4.14.35 kernel or 6.4.0 kernel. > >>>>> Reproduce steps: > >>>>> > >>>>> 1. create a XFS with 10MiB log size (small so easier to reproduce). The following > >>>>> steps all aim at this XFS volume. > >>>> > >>>> Actually, make that a few milliseconds.... :) > >>> > >>> :) > >>> > >>>> mkfs/xfs_info output would be appreciated. > >>> > >>> sure, > >>> # xfs_info 20GB.bk2 > >>> meta-data=20GB.bk2 isize=256 agcount=4, agsize=1310720 blks > >>> = sectsz=512 attr=2, projid32bit=1 > >>> = crc=0 finobt=0, sparse=0, rmapbt=0 > >>> = reflink=0 > >> > >> Hmmmm. Why are you only testing v4 filesystems? They are deprecated > >> and support is largely due to be dropped from upstream in 2025... > >> > > > > Ha, it just happened to be so. > > I was trying to reproduce it in the same environment as customer has — > > that’s OracleLinux7. The default behavior of mkfs.xfs in OL7 is to format > > v4 filesystems. I created the xfs image in a file on OL7 and copied the image > > file to a 6.4.0 kernel machine. That’s why you see v4 filesystem here. > > > >> Does the same problem occur with a v5 filesystems? > > > > Will test and report back. > > > >> > >>>>> 5. Checking the on disk left free log space, it’s 181760 bytes for both 4.14.35 > >>>>> kernel and 6.4.0 kernel. > >>>> > >>>> Which is is clearly wrong. It should be at least 360416 bytes (i.e > >>>> tr_itrunc), because that's what the EFI being processed that pins > >>>> the tail of the log is supposed to have reserved when it was > >>>> stalled. > >>> > >>> Yep, exactly. > >>> > >>>> So where has the ~180kB of leaked space come from? > >>>> > >>>> Have you traced the grant head reservations to find out > >>>> what the runtime log space and grant head reservations actually are? > >>> I have the numbers in vmcore (ignore the WARNs), > >> > >> That's not what I'm asking. You've dumped the values at the time of > >> the hang, not traced the runtime reservations that have been made. > >> > >>>> i.e. we have full tracing of the log reservation accounting via > >>>> tracepoints in the kernel. If there is a leak occurring, you need to > >>>> capture a trace of all the reservation accounting operations and > >>>> post process the output to find out what operation is leaking > >>>> reserved space. e.g. > >>>> > >>>> # trace-cmd record -e xfs_log\* -e xlog\* -e printk touch /mnt/scratch/foo > >>>> .... > >>>> # trace-cmd report > s.t > >>>> # head -3 s.t > >>>> cpus=16 > >>>> touch-289000 [008] 430907.633820: xfs_log_reserve: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 1024 grant_write_cycle 1 grant_write_bytes 1024 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > >>>> touch-289000 [008] 430907.633829: xfs_log_reserve_exit: dev 253:32 t_ocnt 2 t_cnt 2 t_curr_res 240888 t_unit_res 240888 t_flags XLOG_TIC_PERM_RESERV reserveq empty writeq empty grant_reserve_cycle 1 grant_reserve_bytes 482800 grant_write_cycle 1 grant_write_bytes 482800 curr_cycle 1 curr_block 2 tail_cycle 1 tail_block 2 > >>>> > >>>> # > >>>> > >>>> So this tells us the transaction reservation unit size, the count of > >>>> reservations, the current reserve and grant head locations, and the > >>>> current head and tail of the log at the time the transaction > >>>> reservation is started and then after it completes. > >>> > >>> Will do that and report back. You want full log or only some typical > >>> ones? Full log would be big, how shall I share? > >> > >> I don't want to see the log. It'll be huge - I regularly generate > >> traces containing gigabytes of log accounting traces like this from > >> a single workload. > >> > >> What I'm asking you to do is run the tracing and then post process > >> the values from the trace to determine what operation is using more > >> space than is being freed back to the log. > >> > >> I generally do this with grep, awk and sed. some people use python > >> or perl. But either way it's a *lot* of work - in the past I have > >> spent _weeks_ on trace analysis to find a 4 byte leak in the log > >> space accounting. DOing things like graphing the head, tail and grant > >> spaces over time tend to show if this is a gradual leak versus a > >> sudden step change. If it's a sudden step change, then you can > >> isolate it in the trace and work out what happened. If it's a > >> gradual change, then you need to start looking for accounting > >> discrepancies... > >> > >> e.g. a transaction records 32 bytes used in the item, so it releases > >> t_unit - 32 bytes at commit. However, the CIL may then only track 28 > >> bytes of space for the item in the journal and we leak 4 bytes of > >> reservation on every on of those items committed. > >> > >> These sorts of leaks typically only add up to being somethign > >> significant in situations where the log is flooded with tiny inode > >> timestamp changes - 4 bytes iper item doesn't really matter when you > >> only have a few thousand items in the log, but when you have > >> hundreds of thousands of tiny items in the log... > >> > > > > OK. will work more on this. > > # I am going to start a two-week vacation, and will then continue on this when back. > > > > thanks, > > wengang > > ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-24 4:52 ` Darrick J. Wong @ 2023-08-24 7:28 ` Dave Chinner 2023-08-24 22:01 ` Darrick J. Wong 2023-08-24 23:53 ` Wengang Wang 0 siblings, 2 replies; 22+ messages in thread From: Dave Chinner @ 2023-08-24 7:28 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote: > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote: > > Since xfs_efi_item_recover is only performing one step of what could be > a chain of deferred updates, it never rolls the transaction that it > creates. It therefore only requires the amount of grant space that > you'd get with tr_logcount == 1. It is therefore a bit silly that we > ask for more than that, and in bad cases like this, hang log recovery > needlessly. But this doesn't fix the whatever problem lead to the recovery not having the same full tr_itruncate reservation available as was held by the transaction that logged the EFI and was running the extent free at the time the system crashed. There should -always- be enough transaction reservation space in the journal to reserve space for an intent replay if the intent recovery reservation uses the same reservation type as runtime. Hence I think this is still just a band-aid over whatever went wrong at runtime that lead to the log not having enough space for a reservation that was clearly held at runtime and hadn't yet used. > Which is exactly what you theorized above. Ok, I'm starting to be > convinced... :) I'm not. :/ > I wonder, if you add this to the variable declarations in > xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about > UEK5 here): > > struct xfs_trans_resv resv = M_RES(mp)->tr_itruncate; > > and then change the xfs_trans_alloc call to: > > resv.tr_logcount = 1; > error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); > > Does that solve the problem? It might fix this specific case given the state of the log at the time the system crashed. However, it doesn't help the general case where whatever went wrong at runtime results in there being less than a single tr_itruncate reservation unit available in the log. One of the recent RH custom cases I looked at had much less than a single tr_itruncate unit reservation when it came to recovering the EFI, so I'm not just making this up. I really don't think this problem is not solvable at recovery time. if the runtime hang/crash leaves the EFI pinning the tail of the log and something has allowed the log to be overfull (i.e. log space used plus granted EFI reservation > physical log space), then we may not have any space at all for any sort of reservation during recovery. I suspect that the cause of this recovery issue is the we require an overcommit of the log space accounting before we throttle incoming transaction reservations (i.e. the new reservation has to overrun before we throttle). I think that the reservation accounting overrun detection can race to the first item being placed on the wait list, which might allow transactions that should be throttled to escape resulting in a temporary log space overcommit. IF we crash at jsut the wrong time and an intent-in-progress pins the tail of the log, the overcommit situation can lead to recovery not having enough space for intent recovery reservations... This subtle overcommit is one of the issues I have been trying to correct with the byte-based grant head accounting patches (which I'm -finally- getting back to). I know that replaying the log from the metadump that repoduces the hang with the byte-based grant head accounting patchset allowed log recovery to succeed. It has a different concept of where the head and tail are and hence how much log space is actually available at any given time, and that difference was just enough to allow a tr_itruncate reservation to succced. It also has different reservation grant overrun detection logic, but I'm not 100% sure if it solves this underlying runtime overcommit problem or not yet... Cheers, Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-24 7:28 ` Dave Chinner @ 2023-08-24 22:01 ` Darrick J. Wong 2023-08-26 3:37 ` Dave Chinner 2023-08-24 23:53 ` Wengang Wang 1 sibling, 1 reply; 22+ messages in thread From: Darrick J. Wong @ 2023-08-24 22:01 UTC (permalink / raw) To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote: > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote: > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote: > > > > Since xfs_efi_item_recover is only performing one step of what could be > > a chain of deferred updates, it never rolls the transaction that it > > creates. It therefore only requires the amount of grant space that > > you'd get with tr_logcount == 1. It is therefore a bit silly that we > > ask for more than that, and in bad cases like this, hang log recovery > > needlessly. > > But this doesn't fix the whatever problem lead to the recovery not > having the same full tr_itruncate reservation available as was held > by the transaction that logged the EFI and was running the extent > free at the time the system crashed. There should -always- be enough > transaction reservation space in the journal to reserve space for an > intent replay if the intent recovery reservation uses the same > reservation type as runtime. > > Hence I think this is still just a band-aid over whatever went wrong > at runtime that lead to the log not having enough space for a > reservation that was clearly held at runtime and hadn't yet used. Maybe I'm not remembering accurately how permanent log reservations work. Let's continue picking on tr_itruncate from Wengang's example. IIRC, he said that tr_itruncate on the running system was defined roughly like so: tr_itruncate = { .tr_logres = 180K .tr_logcount = 2, .tr_logflags = XFS_TRANS_PERM_LOG_RES, } At runtime, when we want to start a truncation update, we do this: xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...); Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve -> xlog_ticket_alloc. Ticket allocation assigns tic->t_unit_res = (tr_logres + overhead) and tic->t_cnt = tr_logcount. (Let's pretend for the sake of argument that the overhead is 5K.) Now xfs_log_reserve calls xlog_grant_head_check. That does: *need_bytes = xlog_ticket_reservation(log, head, tic); For the reserve head, the ticket reservation computation is (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) == 370K, right? So we make sure there's at least 370K free in the reserve head, then add that to the reserve and write heads. Now that we've allocated the transaction, delete the bmap mapping, log an EFI to free the space, and roll the transaction as part of finishing the deferops chain. Rolling creates a new xfs_trans which shares its ticket with the old transaction. Next, xfs_trans_roll calls __xfs_trans_commit with regrant == true, which calls xlog_cil_commit with the same regrant parameter. xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and subtracts t_curr_res from the reservation and write heads. If the filesystem is fresh and the first transaction only used (say) 20K, then t_curr_res will be 165K, and we give that much reservation back to the reservation head. Or if the file is really fragmented and the first transaction actually uses 170K, then t_curr_res will be 15K, and that's what we give back to the reservation. Having done that, we're now headed into the second transaction with an EFI and 185K of reservation, correct? Now let's say the first transaction gets written to disk and we crash without ever completing the second transaction. Now we remount the fs, log recovery finds the unfinished EFI, and calls xfs_efi_recover to finish the EFI. However, xfs_efi_recover starts a new tr_itruncate tranasction, which asks for (185K * 2) == 370K of log reservation. This is a lot more than the 185K that we had reserved at the time of the crash. Did I get all that right? If so, what if prior to the crash, the system had many other threads initiating much smaller transactions, say for mtime updates? And what if those threads consumed all the rest of the log reservation and pushed those transactions out to disk after the first transaction in the tr_itruncate chain? Won't log recovery then find the log to contain (in order): 1. The first itruncate transaction and the EFI. 2. A bunch of mtime update transactions. 3. Approximately 185K of free space. We can't move the log tail forward because we haven't dealt with the EFI; there's only 185K of log space left to reserve; and xfs_efi_recover wants 370K. It's only going to take one step in the intent processing chain and commit that step, so I don't see why the recovery transaction needs more than one step's worth of log space. Any subsequent steps in the deferops chain get a new transaction later once we've unpinned the log tail. > > Which is exactly what you theorized above. Ok, I'm starting to be > > convinced... :) > > I'm not. :/ > > > I wonder, if you add this to the variable declarations in > > xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about > > UEK5 here): > > > > struct xfs_trans_resv resv = M_RES(mp)->tr_itruncate; > > > > and then change the xfs_trans_alloc call to: > > > > resv.tr_logcount = 1; > > error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); > > > > Does that solve the problem? > > It might fix this specific case given the state of the log at the > time the system crashed. Yes. There are running kernels out there that persist this situation (bugs and all) in the ondisk log. I know so, because the issue that Wengang has been dealing with came from a customer. IIRC, the customer report stated that they were truncating some heavily fragmented sparse files, the node stalled long enough that the node manager rebooted the node, and when the node came back up, log recovery stalled because that's what the running kernel had written to the log. IOWs, I'm operating on an assumption that we have two problems to solve: the runtime acconting bug that you've been chasing, and Wengang's where log recovery asks for more log space than what had been in the log ticket at the time of the crash. > However, it doesn't help the general > case where whatever went wrong at runtime results in there being > less than a single tr_itruncate reservation unit available in the > log. I agree that this fix won't address the problem of how runtime got here in the first place. > One of the recent RH custom cases I looked at had much less than a > single tr_itruncate unit reservation when it came to recovering the > EFI, so I'm not just making this up. I didn't think you were, I merely am not ready to decide that these aren't separate problems. :) > I really don't think this problem is not solvable at recovery time. > if the runtime hang/crash leaves the EFI pinning the tail of the log > and something has allowed the log to be overfull (i.e. log space > used plus granted EFI reservation > physical log space), then we may > not have any space at all for any sort of reservation during > recovery. > > I suspect that the cause of this recovery issue is the we require an > overcommit of the log space accounting before we throttle incoming > transaction reservations (i.e. the new reservation has to overrun > before we throttle). I think that the reservation accounting overrun > detection can race to the first item being placed on the wait list, Yeah, I was wondering that myself when I was looking at the logic between list_empty_careful and the second xlog_grant_head_wait and wondering if that "careful" construction actually worked. > which might allow transactions that should be throttled to escape > resulting in a temporary log space overcommit. IF we crash at jsut > the wrong time and an intent-in-progress pins the tail of the log, > the overcommit situation can lead to recovery not having enough > space for intent recovery reservations... > > This subtle overcommit is one of the issues I have been trying to > correct with the byte-based grant head accounting patches (which I'm > -finally- getting back to). I know that replaying the log from the > metadump that repoduces the hang with the byte-based grant head > accounting patchset allowed log recovery to succeed. It has a > different concept of where the head and tail are and hence how much > log space is actually available at any given time, and that > difference was just enough to allow a tr_itruncate reservation to > succced. It also has different reservation grant overrun detection > logic, but I'm not 100% sure if it solves this underlying runtime > overcommit problem or not yet... <nod> --D > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-24 22:01 ` Darrick J. Wong @ 2023-08-26 3:37 ` Dave Chinner 2023-08-27 16:04 ` Darrick J. Wong 0 siblings, 1 reply; 22+ messages in thread From: Dave Chinner @ 2023-08-26 3:37 UTC (permalink / raw) To: Darrick J. Wong; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S On Thu, Aug 24, 2023 at 03:01:54PM -0700, Darrick J. Wong wrote: > On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote: > > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote: > > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote: > > > > > > Since xfs_efi_item_recover is only performing one step of what could be > > > a chain of deferred updates, it never rolls the transaction that it > > > creates. It therefore only requires the amount of grant space that > > > you'd get with tr_logcount == 1. It is therefore a bit silly that we > > > ask for more than that, and in bad cases like this, hang log recovery > > > needlessly. > > > > But this doesn't fix the whatever problem lead to the recovery not > > having the same full tr_itruncate reservation available as was held > > by the transaction that logged the EFI and was running the extent > > free at the time the system crashed. There should -always- be enough > > transaction reservation space in the journal to reserve space for an > > intent replay if the intent recovery reservation uses the same > > reservation type as runtime. > > > > Hence I think this is still just a band-aid over whatever went wrong > > at runtime that lead to the log not having enough space for a > > reservation that was clearly held at runtime and hadn't yet used. > > Maybe I'm not remembering accurately how permanent log reservations > work. Let's continue picking on tr_itruncate from Wengang's example. > IIRC, he said that tr_itruncate on the running system was defined > roughly like so: > > tr_itruncate = { > .tr_logres = 180K > .tr_logcount = 2, > .tr_logflags = XFS_TRANS_PERM_LOG_RES, > } > > At runtime, when we want to start a truncation update, we do this: > > xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...); > > Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve > -> xlog_ticket_alloc. Ticket allocation assigns tic->t_unit_res = > (tr_logres + overhead) and tic->t_cnt = tr_logcount. > > (Let's pretend for the sake of argument that the overhead is 5K.) > > Now xfs_log_reserve calls xlog_grant_head_check. That does: > > *need_bytes = xlog_ticket_reservation(log, head, tic); > > For the reserve head, the ticket reservation computation is > (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) == > 370K, right? So we make sure there's at least 370K free in the reserve > head, then add that to the reserve and write heads. > > Now that we've allocated the transaction, delete the bmap mapping, > log an EFI to free the space, and roll the transaction as part of > finishing the deferops chain. Rolling creates a new xfs_trans which > shares its ticket with the old transaction. Next, xfs_trans_roll calls > __xfs_trans_commit with regrant == true, which calls xlog_cil_commit > with the same regrant parameter. > > xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and > subtracts t_curr_res from the reservation and write heads. > > If the filesystem is fresh and the first transaction only used (say) > 20K, then t_curr_res will be 165K, and we give that much reservation > back to the reservation head. Or if the file is really fragmented and > the first transaction actually uses 170K, then t_curr_res will be 15K, > and that's what we give back to the reservation. > > Having done that, we're now headed into the second transaction with an > EFI and 185K of reservation, correct? Ah, right, I overlooked that the long running truncates only regrant a single reservation unit at a time when they roll, and the runtime instances I seen of this are with long running truncate operations (i.e. inactivation after unlink for multi-thousand extent inodes). So, yes, all types of intent recovery must only use a single unit reservation, not just EFIs, because there is no guarantee that there is a full unit * count reservation available in the journal whenever that intent was first logged. Indeed, at best it will be 'count - 1' that is avaialble, becuase the transaction that logged the intent will have used a full unit of the original reservation.... > IOWs, I'm operating on an assumption that we have two problems to solve: > the runtime acconting bug that you've been chasing, and Wengang's where > log recovery asks for more log space than what had been in the log > ticket at the time of the crash. OK, yes, it does seem there are two problems here... > > I suspect that the cause of this recovery issue is the we require an > > overcommit of the log space accounting before we throttle incoming > > transaction reservations (i.e. the new reservation has to overrun > > before we throttle). I think that the reservation accounting overrun > > detection can race to the first item being placed on the wait list, > > Yeah, I was wondering that myself when I was looking at the logic > between list_empty_careful and the second xlog_grant_head_wait and > wondering if that "careful" construction actually worked. Well, it's not the careful check that is the problem; the race is much simpler than that. We just have to have two concurrent reservations that will both fit in the remaining log space individually but not together. i.e. this is the overcommit race window: P1 P2 --------------------------- -------------------------- xfs_log_reserve(10000 bytes) xlog_grant_head_check() xlog_space_left() <sees 12000 bytes> OK, doesn't wait xfs_log_reserve(10000 bytes) xlog_grant_head_check() xlog_space_left() <sees 12000 bytes> OK, doesn't wait xlog_grant_add_space(10k) xlog_grant_add_space(10k) And now we've overcommitted the log by 8000 bytes.... The current byte based grant head patch set does not address this; the attempt I made to fix it didn't work properly, so I split the rework of the throttling out of the byte based accounting patchset (which seems to work correctly without reworking xlog_grant_head_check()) and now I'm trying to address this race condition as a separate patchset... -Dave. -- Dave Chinner david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-26 3:37 ` Dave Chinner @ 2023-08-27 16:04 ` Darrick J. Wong 0 siblings, 0 replies; 22+ messages in thread From: Darrick J. Wong @ 2023-08-27 16:04 UTC (permalink / raw) To: Dave Chinner; +Cc: Wengang Wang, linux-xfs@vger.kernel.org, Srikanth C S On Sat, Aug 26, 2023 at 01:37:24PM +1000, Dave Chinner wrote: > On Thu, Aug 24, 2023 at 03:01:54PM -0700, Darrick J. Wong wrote: > > On Thu, Aug 24, 2023 at 05:28:23PM +1000, Dave Chinner wrote: > > > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote: > > > > On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote: > > > > > > > > Since xfs_efi_item_recover is only performing one step of what could be > > > > a chain of deferred updates, it never rolls the transaction that it > > > > creates. It therefore only requires the amount of grant space that > > > > you'd get with tr_logcount == 1. It is therefore a bit silly that we > > > > ask for more than that, and in bad cases like this, hang log recovery > > > > needlessly. > > > > > > But this doesn't fix the whatever problem lead to the recovery not > > > having the same full tr_itruncate reservation available as was held > > > by the transaction that logged the EFI and was running the extent > > > free at the time the system crashed. There should -always- be enough > > > transaction reservation space in the journal to reserve space for an > > > intent replay if the intent recovery reservation uses the same > > > reservation type as runtime. > > > > > > Hence I think this is still just a band-aid over whatever went wrong > > > at runtime that lead to the log not having enough space for a > > > reservation that was clearly held at runtime and hadn't yet used. > > > > Maybe I'm not remembering accurately how permanent log reservations > > work. Let's continue picking on tr_itruncate from Wengang's example. > > IIRC, he said that tr_itruncate on the running system was defined > > roughly like so: > > > > tr_itruncate = { > > .tr_logres = 180K > > .tr_logcount = 2, > > .tr_logflags = XFS_TRANS_PERM_LOG_RES, > > } > > > > At runtime, when we want to start a truncation update, we do this: > > > > xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, ...); > > > > Call sequence: xfs_trans_alloc -> xfs_trans_reserve -> xfs_log_reserve > > -> xlog_ticket_alloc. Ticket allocation assigns tic->t_unit_res = > > (tr_logres + overhead) and tic->t_cnt = tr_logcount. > > > > (Let's pretend for the sake of argument that the overhead is 5K.) > > > > Now xfs_log_reserve calls xlog_grant_head_check. That does: > > > > *need_bytes = xlog_ticket_reservation(log, head, tic); > > > > For the reserve head, the ticket reservation computation is > > (tic->t_unit_res * tic->t_cnt), which in this case is (185K * 2) == > > 370K, right? So we make sure there's at least 370K free in the reserve > > head, then add that to the reserve and write heads. > > > > Now that we've allocated the transaction, delete the bmap mapping, > > log an EFI to free the space, and roll the transaction as part of > > finishing the deferops chain. Rolling creates a new xfs_trans which > > shares its ticket with the old transaction. Next, xfs_trans_roll calls > > __xfs_trans_commit with regrant == true, which calls xlog_cil_commit > > with the same regrant parameter. > > > > xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and > > subtracts t_curr_res from the reservation and write heads. > > > > If the filesystem is fresh and the first transaction only used (say) > > 20K, then t_curr_res will be 165K, and we give that much reservation > > back to the reservation head. Or if the file is really fragmented and > > the first transaction actually uses 170K, then t_curr_res will be 15K, > > and that's what we give back to the reservation. > > > > Having done that, we're now headed into the second transaction with an > > EFI and 185K of reservation, correct? > > Ah, right, I overlooked that the long running truncates only regrant a > single reservation unit at a time when they roll, and the runtime > instances I seen of this are with long running truncate operations > (i.e. inactivation after unlink for multi-thousand extent inodes). > > So, yes, all types of intent recovery must only use a single unit > reservation, not just EFIs, because there is no guarantee that there > is a full unit * count reservation available in the journal whenever > that intent was first logged. Indeed, at best it will be 'count - 1' > that is avaialble, becuase the transaction that logged the intent > will have used a full unit of the original reservation.... <nod> I think something like this will fix the problem, right? diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h index 2420865f3007..a5100a11faf9 100644 --- a/fs/xfs/libxfs/xfs_log_recover.h +++ b/fs/xfs/libxfs/xfs_log_recover.h @@ -131,4 +131,26 @@ void xlog_check_buf_cancel_table(struct xlog *log); #define xlog_check_buf_cancel_table(log) do { } while (0) #endif +/* + * Transform a regular reservation into one suitable for recovery of a log + * intent item. + * + * Intent recovery only runs a single step of the transaction chain and defers + * the rest to a separate transaction. Therefore, we reduce logcount to 1 here + * to avoid livelocks if the log grant space is nearly exhausted due to the + * recovered intent pinning the tail. Keep the same logflags to avoid tripping + * asserts elsewhere. Struct copies abound below. + */ +static inline struct xfs_trans_res +xlog_recover_resv(const struct xfs_trans_res *r) +{ + struct xfs_trans_res ret = { + .tr_logres = r->tr_logres, + .tr_logcount = 1, + .tr_logflags = r->tr_logflags, + }; + + return ret; +} + #endif /* __XFS_LOG_RECOVER_H__ */ diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c index 2788a6f2edcd..36fe2abb16e6 100644 --- a/fs/xfs/xfs_attr_item.c +++ b/fs/xfs/xfs_attr_item.c @@ -547,7 +547,7 @@ xfs_attri_item_recover( struct xfs_inode *ip; struct xfs_da_args *args; struct xfs_trans *tp; - struct xfs_trans_res tres; + struct xfs_trans_res resv; struct xfs_attri_log_format *attrp; struct xfs_attri_log_nameval *nv = attrip->attri_nameval; int error; @@ -618,8 +618,9 @@ xfs_attri_item_recover( goto out; } - xfs_init_attr_trans(args, &tres, &total); - error = xfs_trans_alloc(mp, &tres, total, 0, XFS_TRANS_RESERVE, &tp); + xfs_init_attr_trans(args, &resv, &total); + resv = xlog_recover_resv(&resv); + error = xfs_trans_alloc(mp, &resv, total, 0, XFS_TRANS_RESERVE, &tp); if (error) goto out; diff --git a/fs/xfs/xfs_bmap_item.c b/fs/xfs/xfs_bmap_item.c index 7551c3ec4ea5..e736a0844c89 100644 --- a/fs/xfs/xfs_bmap_item.c +++ b/fs/xfs/xfs_bmap_item.c @@ -490,6 +490,7 @@ xfs_bui_item_recover( struct list_head *capture_list) { struct xfs_bmap_intent fake = { }; + struct xfs_trans_res resv; struct xfs_bui_log_item *buip = BUI_ITEM(lip); struct xfs_trans *tp; struct xfs_inode *ip = NULL; @@ -515,7 +516,8 @@ xfs_bui_item_recover( return error; /* Allocate transaction and do the work. */ - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, + resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); + error = xfs_trans_alloc(mp, &resv, XFS_EXTENTADD_SPACE_RES(mp, XFS_DATA_FORK), 0, 0, &tp); if (error) goto err_rele; diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c index f1a5ecf099aa..3fa8789820ad 100644 --- a/fs/xfs/xfs_extfree_item.c +++ b/fs/xfs/xfs_extfree_item.c @@ -660,6 +660,7 @@ xfs_efi_item_recover( struct xfs_log_item *lip, struct list_head *capture_list) { + struct xfs_trans_res resv; struct xfs_efi_log_item *efip = EFI_ITEM(lip); struct xfs_mount *mp = lip->li_log->l_mp; struct xfs_efd_log_item *efdp; @@ -683,7 +684,8 @@ xfs_efi_item_recover( } } - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp); + resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); + error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); if (error) return error; efdp = xfs_trans_get_efd(tp, efip, efip->efi_format.efi_nextents); diff --git a/fs/xfs/xfs_refcount_item.c b/fs/xfs/xfs_refcount_item.c index edd8587658d5..2d4444d61e98 100644 --- a/fs/xfs/xfs_refcount_item.c +++ b/fs/xfs/xfs_refcount_item.c @@ -477,6 +477,7 @@ xfs_cui_item_recover( struct xfs_log_item *lip, struct list_head *capture_list) { + struct xfs_trans_res resv; struct xfs_cui_log_item *cuip = CUI_ITEM(lip); struct xfs_cud_log_item *cudp; struct xfs_trans *tp; @@ -514,8 +515,9 @@ xfs_cui_item_recover( * doesn't fit. We need to reserve enough blocks to handle a * full btree split on either end of the refcount range. */ - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, - mp->m_refc_maxlevels * 2, 0, XFS_TRANS_RESERVE, &tp); + resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); + error = xfs_trans_alloc(mp, &resv, mp->m_refc_maxlevels * 2, 0, + XFS_TRANS_RESERVE, &tp); if (error) return error; diff --git a/fs/xfs/xfs_rmap_item.c b/fs/xfs/xfs_rmap_item.c index 520c7ebdfed8..0e0e747028da 100644 --- a/fs/xfs/xfs_rmap_item.c +++ b/fs/xfs/xfs_rmap_item.c @@ -507,6 +507,7 @@ xfs_rui_item_recover( struct xfs_log_item *lip, struct list_head *capture_list) { + struct xfs_trans_res resv; struct xfs_rui_log_item *ruip = RUI_ITEM(lip); struct xfs_rud_log_item *rudp; struct xfs_trans *tp; @@ -530,8 +531,9 @@ xfs_rui_item_recover( } } - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, - mp->m_rmap_maxlevels, 0, XFS_TRANS_RESERVE, &tp); + resv = xlog_recover_resv(&M_RES(mp)->tr_itruncate); + error = xfs_trans_alloc(mp, &resv, mp->m_rmap_maxlevels, 0, + XFS_TRANS_RESERVE, &tp); if (error) return error; rudp = xfs_trans_get_rud(tp, ruip); > > IOWs, I'm operating on an assumption that we have two problems to solve: > > the runtime acconting bug that you've been chasing, and Wengang's where > > log recovery asks for more log space than what had been in the log > > ticket at the time of the crash. > > OK, yes, it does seem there are two problems here... > > > > I suspect that the cause of this recovery issue is the we require an > > > overcommit of the log space accounting before we throttle incoming > > > transaction reservations (i.e. the new reservation has to overrun > > > before we throttle). I think that the reservation accounting overrun > > > detection can race to the first item being placed on the wait list, > > > > Yeah, I was wondering that myself when I was looking at the logic > > between list_empty_careful and the second xlog_grant_head_wait and > > wondering if that "careful" construction actually worked. > > Well, it's not the careful check that is the problem; the race is > much simpler than that. We just have to have two concurrent > reservations that will both fit in the remaining log space > individually but not together. i.e. this is the overcommit race > window: > > P1 P2 > --------------------------- -------------------------- > xfs_log_reserve(10000 bytes) > xlog_grant_head_check() > xlog_space_left() > <sees 12000 bytes> > OK, doesn't wait > xfs_log_reserve(10000 bytes) > xlog_grant_head_check() > xlog_space_left() > <sees 12000 bytes> > OK, doesn't wait > xlog_grant_add_space(10k) > xlog_grant_add_space(10k) > > And now we've overcommitted the log by 8000 bytes.... > > The current byte based grant head patch set does not address this; > the attempt I made to fix it didn't work properly, so I split the > rework of the throttling out of the byte based accounting patchset > (which seems to work correctly without reworking > xlog_grant_head_check()) and now I'm trying to address this race > condition as a separate patchset... Oh. Heh. Yeah. --D > -Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply related [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-08-24 7:28 ` Dave Chinner 2023-08-24 22:01 ` Darrick J. Wong @ 2023-08-24 23:53 ` Wengang Wang 1 sibling, 0 replies; 22+ messages in thread From: Wengang Wang @ 2023-08-24 23:53 UTC (permalink / raw) To: Dave Chinner; +Cc: Darrick J. Wong, linux-xfs@vger.kernel.org, Srikanth C S > On Aug 24, 2023, at 12:28 AM, Dave Chinner <david@fromorbit.com> wrote: > > On Wed, Aug 23, 2023 at 09:52:34PM -0700, Darrick J. Wong wrote: >> On Fri, Aug 18, 2023 at 03:25:46AM +0000, Wengang Wang wrote: >> >> Since xfs_efi_item_recover is only performing one step of what could be >> a chain of deferred updates, it never rolls the transaction that it >> creates. It therefore only requires the amount of grant space that >> you'd get with tr_logcount == 1. It is therefore a bit silly that we >> ask for more than that, and in bad cases like this, hang log recovery >> needlessly. > > But this doesn't fix the whatever problem lead to the recovery not > having the same full tr_itruncate reservation available as was held > by the transaction that logged the EFI and was running the extent > free at the time the system crashed. There should -always- be enough > transaction reservation space in the journal to reserve space for an > intent replay if the intent recovery reservation uses the same > reservation type as runtime. > > Hence I think this is still just a band-aid over whatever went wrong > at runtime that lead to the log not having enough space for a > reservation that was clearly held at runtime and hadn't yet used. Hi Dave, I was thinking there is a problem previously, but I didn’t find anything wrong. Per my test and analysis, the changing (adding and subtracting) to/from xlog->l_reserve_head and xlog->l_write_head matches the changes of xlog->l_curr_cycle/l_curr_block. And xlog->l_curr_cycle/l_curr_block matches the disk log head change. Actually there is no problem at runtime. What’s wrong is just the recovery part. As I said previously, it’s not true that we always have the full tr_itruncate reservation at any time during the process. Let me show the detailed steps: 1) We do full tr_itruncate reservation for the two-transaction chain (half each), and get the first transaction. 2) work with the first transaction, adding log items to it. Now we have full tr_itruncate reservation. 3) in transaction roll, we get the second transaction. The second transaction also has half tr_itruncate reservation. Now the first transaction and the second transaction together have full tr_itruncate reservation. 4) still in transaction roll, we commit the first transaction, including EFI log item(s) in it. The reservation belonging to the first transaction went in three different ways: 1. some are used to store the log items (and the necessary headers). 2. a small part is stolen by the ticket on checkpoint context. 3. the left is released after the log items are committed to CIL. xlog_cil_commit() -> xfs_log_ticket_regrant() -> xlog_grant_sub_space(). Now we have only half tr_itruncate reservation (for the second trans in chain), NOT full! 5) Work with the second transaction (it’s slow some how). Now we have half tr_itruncate reservation. 6) Parallel transitions come and go, using up other free log bytes. Now we have half tr_itruncate reservation. And on disk log, the free log bytes is only a bit more than half tr_itruncate reservation, but less than full tr_itruncate reservation. 7) Still work with the second transaction in the chain, freeing the extents, system crashed. Now half tr_itruncate reservation. So you see, since step 5) there is only half tr_itruncate reservation exist. The idea that from step 1) to step 7) there should be full tr_itruncate reservation exist is wrong. You can’t after the first transaction is committed, the reservation for that transaction is till there, right. Thanks for your following comments, but they look based on the wrong idea. Thanks, Wengang > >> Which is exactly what you theorized above. Ok, I'm starting to be >> convinced... :) > > I'm not. :/ > >> I wonder, if you add this to the variable declarations in >> xfs_efi_item_recover (or xfs_efi_recover if we're actually talking about >> UEK5 here): >> >> struct xfs_trans_resv resv = M_RES(mp)->tr_itruncate; >> >> and then change the xfs_trans_alloc call to: >> >> resv.tr_logcount = 1; >> error = xfs_trans_alloc(mp, &resv, 0, 0, 0, &tp); >> >> Does that solve the problem? > > It might fix this specific case given the state of the log at the > time the system crashed. However, it doesn't help the general > case where whatever went wrong at runtime results in there being > less than a single tr_itruncate reservation unit available in the > log. > > One of the recent RH custom cases I looked at had much less than a > single tr_itruncate unit reservation when it came to recovering the > EFI, so I'm not just making this up. > > I really don't think this problem is not solvable at recovery time. > if the runtime hang/crash leaves the EFI pinning the tail of the log > and something has allowed the log to be overfull (i.e. log space > used plus granted EFI reservation > physical log space), then we may > not have any space at all for any sort of reservation during > recovery. > > I suspect that the cause of this recovery issue is the we require an > overcommit of the log space accounting before we throttle incoming > transaction reservations (i.e. the new reservation has to overrun > before we throttle). I think that the reservation accounting overrun > detection can race to the first item being placed on the wait list, > which might allow transactions that should be throttled to escape > resulting in a temporary log space overcommit. IF we crash at jsut > the wrong time and an intent-in-progress pins the tail of the log, > the overcommit situation can lead to recovery not having enough > space for intent recovery reservations... > > This subtle overcommit is one of the issues I have been trying to > correct with the byte-based grant head accounting patches (which I'm > -finally- getting back to). I know that replaying the log from the > metadump that repoduces the hang with the byte-based grant head > accounting patchset allowed log recovery to succeed. It has a > different concept of where the head and tail are and hence how much > log space is actually available at any given time, and that > difference was just enough to allow a tr_itruncate reservation to > succced. It also has different reservation grant overrun detection > logic, but I'm not 100% sure if it solves this underlying runtime > overcommit problem or not yet... > > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com ^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: Question: reserve log space at IO time for recover 2023-07-19 0:11 ` Dave Chinner 2023-07-19 1:44 ` Darrick J. Wong @ 2023-07-19 1:46 ` Wengang Wang 1 sibling, 0 replies; 22+ messages in thread From: Wengang Wang @ 2023-07-19 1:46 UTC (permalink / raw) To: Dave Chinner; +Cc: linux-xfs@vger.kernel.org, Srikanth C S > On Jul 18, 2023, at 5:11 PM, Dave Chinner <david@fromorbit.com> wrote: > > On Tue, Jul 18, 2023 at 10:57:38PM +0000, Wengang Wang wrote: >> Hi, >> >> I have a XFS metadump (was running with 4.14.35 plussing some back ported patches), >> mounting it (log recover) hang at log space reservation. There is 181760 bytes on-disk >> free journal space, while the transaction needs to reserve 360416 bytes to start the recovery. >> Thus the mount hangs for ever. > > Most likely something went wrong at runtime on the 4.14.35 kernel > prior to the crash, leaving the on-disk state in an impossible to > recover state. Likely an accounting leak in a transaction > reservation somewhere, likely in passing the space used from the > transaction to the CIL. We've had bugs in this area before, they > eventually manifest in log hangs like this either at runtime or > during recovery... > >> That happens with 4.14.35 kernel and also upstream >> kernel (6.4.0). > > Upgrading the kernel won't fix recovery - it is likely that the > journal state on disk is invalid and so the mount cannot complete > >> The is the related stack dumping (6.4.0 kernel): >> >> [<0>] xlog_grant_head_wait+0xbd/0x200 [xfs] >> [<0>] xlog_grant_head_check+0xd9/0x100 [xfs] >> [<0>] xfs_log_reserve+0xbc/0x1e0 [xfs] >> [<0>] xfs_trans_reserve+0x138/0x170 [xfs] >> [<0>] xfs_trans_alloc+0xe8/0x220 [xfs] >> [<0>] xfs_efi_item_recover+0x110/0x250 [xfs] >> [<0>] xlog_recover_process_intents.isra.28+0xba/0x2d0 [xfs] >> [<0>] xlog_recover_finish+0x33/0x310 [xfs] >> [<0>] xfs_log_mount_finish+0xdb/0x160 [xfs] >> [<0>] xfs_mountfs+0x51c/0x900 [xfs] >> [<0>] xfs_fs_fill_super+0x4b8/0x940 [xfs] >> [<0>] get_tree_bdev+0x193/0x280 >> [<0>] vfs_get_tree+0x26/0xd0 >> [<0>] path_mount+0x69d/0x9b0 >> [<0>] do_mount+0x7d/0xa0 >> [<0>] __x64_sys_mount+0xdc/0x100 >> [<0>] do_syscall_64+0x3b/0x90 >> [<0>] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 >> >> Thus we can say 4.14.35 kernel didn’t reserve log space at IO time to make log recover >> safe. Upstream kernel doesn’t do that either if I read the source code right (I might be wrong). > > Sure they do. > > Log space usage is what the grant heads track; transactions are not > allowed to start if there isn't both reserve and write grant head > space available for them, and transaction rolls get held until there > is write grant space available for them (i.e. they can block in > xfs_trans_roll() -> xfs_trans_reserve() waiting for write grant head > space). > > There have been bugs in the grant head accounting mechanisms in the > past, there may well still be bugs in it. But it is the grant head > mechanisms that is supposed to guarantee there is always space in > the journal for a transaction to commit, and by extension, ensure > that we always have space in the journal for a transaction to be > fully recovered. > >> So shall we reserve proper amount of log space at IO time, call it Unflush-Reserve, to >> ensure log recovery safe? The number of UR is determined by current un flushed log items. >> It gets increased just after transaction is committed and gets decreased when log items are >> flushed. With the UR, we are safe to have enough log space for the transactions used by log >> recovery. > > The grant heads already track log space usage and reservations like > this. If you want to learn more about the nitty gritty details, look > at this patch set that is aimed at changing how the grant heads > track the used/reserved log space to improve performance: > > https://lore.kernel.org/linux-xfs/20221220232308.3482960-1-david@fromorbit.com/ Thanks Dave a lot! I will look more into the write head and above patch set. Have a good day, Wengang ^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2023-08-27 16:05 UTC | newest] Thread overview: 22+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-07-18 22:57 Question: reserve log space at IO time for recover Wengang Wang 2023-07-19 0:11 ` Dave Chinner 2023-07-19 1:44 ` Darrick J. Wong 2023-07-19 6:25 ` Dave Chinner 2023-07-21 19:36 ` Wengang Wang 2023-07-24 0:57 ` Dave Chinner 2023-07-24 18:03 ` Wengang Wang 2023-07-26 4:08 ` Dave Chinner 2023-07-26 15:23 ` Darrick J. Wong 2023-07-27 1:05 ` Dave Chinner 2023-07-28 17:56 ` Wengang Wang 2023-08-18 3:25 ` Wengang Wang 2023-08-21 22:06 ` Wengang Wang 2023-08-24 5:05 ` Darrick J. Wong 2023-08-24 22:55 ` Wengang Wang 2023-08-24 4:52 ` Darrick J. Wong 2023-08-24 7:28 ` Dave Chinner 2023-08-24 22:01 ` Darrick J. Wong 2023-08-26 3:37 ` Dave Chinner 2023-08-27 16:04 ` Darrick J. Wong 2023-08-24 23:53 ` Wengang Wang 2023-07-19 1:46 ` Wengang Wang
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox