[PATCH 0/3] xfs: miscellaneous bug fixes

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/3] xfs: miscellaneous bug fixes
@ 2024-11-12 22:05 Dave Chinner
  2024-11-12 22:05 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Dave Chinner @ 2024-11-12 22:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: cem

These are three bug fixes for recent issues.

The first is a repost of the original patch to prevent allocation of
sparse inode clusters at the end of an unaligned runt AG. There
was plenty of discussion over that fix here:

https://lore.kernel.org/linux-xfs/20241024025142.4082218-1-david@fromorbit.com/

And the outcome of that discussion is that we can't allow sparse
inode clusters overlapping the end of the runt AG without an on disk
format definition change. Hence this patch to ensure the check is
done correctly is the only change we need to make to the kernel to
avoid this problem in the future.

Filesystems that have this problem on disk will need to run
xfs_repair to remove the bad cluster, but no data loss is possible
from this because the kernel currently disallows inode allocation
from the bad cluster and so none of the inodes in the sparse cluster
can actually be used. Hence there is no possible data loss or other
metadata corruption possible from this situation, all we need to do
is ensure that it doesn't happen again once repair has done it's
work.

The other two patches are for issues I've recently hit when running
lots of fstests in parallel. That changes loading and hence timing
of events during tests, exposing latent race conditions in the code.
The quota fix removes racy debug code that has been there since the
quota code was first committed in 1996.

The log shutdown race fix is a much more recent issue created by
trying to ensure shutdowns operate in a sane and predictable manner.
The logic flaw is that we allow multiple log shutdowns to start and
force the log before selecting on a single log shutdown task. This
leads to a situation where shutdown log item callback processing
gets stuck waiting on a task holding a buffer lock that is waiting
on a log force that is waiting on shutdown log item callback
processing to complete...

Thoughts?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] xfs: fix sparse inode limits on runt AG
  2024-11-12 22:05 [PATCH 0/3] xfs: miscellaneous bug fixes Dave Chinner
@ 2024-11-12 22:05 ` Dave Chinner
  2024-11-12 23:15   ` Darrick J. Wong
  2024-11-12 22:05 ` [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent Dave Chinner
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2024-11-12 22:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: cem

From: Dave Chinner <dchinner@redhat.com>

The runt AG at the end of a filesystem is almost always smaller than
the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
limit for the inode chunk allocation, we do not take this into
account. This means we can allocate a sparse inode chunk that
overlaps beyond the end of an AG. When we go to allocate an inode
from that sparse chunk, the irec fails validation because the
agbno of the start of the irec is beyond valid limits for the runt
AG.

Prevent this from happening by taking into account the size of the
runt AG when allocating inode chunks. Also convert the various
checks for valid inode chunk agbnos to use xfs_ag_block_count()
so that they will also catch such issues in the future.

Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 271855227514..6258527315f2 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -855,7 +855,8 @@ xfs_ialloc_ag_alloc(
 		 * the end of the AG.
 		 */
 		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
-		args.max_agbno = round_down(args.mp->m_sb.sb_agblocks,
+		args.max_agbno = round_down(xfs_ag_block_count(args.mp,
+							pag->pag_agno),
 					    args.mp->m_sb.sb_inoalignmt) -
 				 igeo->ialloc_blks;
 
@@ -2332,9 +2333,9 @@ xfs_difree(
 		return -EINVAL;
 	}
 	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
-	if (agbno >= mp->m_sb.sb_agblocks)  {
-		xfs_warn(mp, "%s: agbno >= mp->m_sb.sb_agblocks (%d >= %d).",
-			__func__, agbno, mp->m_sb.sb_agblocks);
+	if (agbno >= xfs_ag_block_count(mp, pag->pag_agno)) {
+		xfs_warn(mp, "%s: agbno >= xfs_ag_block_count (%d >= %d).",
+			__func__, agbno, xfs_ag_block_count(mp, pag->pag_agno));
 		ASSERT(0);
 		return -EINVAL;
 	}
@@ -2457,7 +2458,7 @@ xfs_imap(
 	 */
 	agino = XFS_INO_TO_AGINO(mp, ino);
 	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
-	if (agbno >= mp->m_sb.sb_agblocks ||
+	if (agbno >= xfs_ag_block_count(mp, pag->pag_agno) ||
 	    ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino)) {
 		error = -EINVAL;
 #ifdef DEBUG
@@ -2467,11 +2468,12 @@ xfs_imap(
 		 */
 		if (flags & XFS_IGET_UNTRUSTED)
 			return error;
-		if (agbno >= mp->m_sb.sb_agblocks) {
+		if (agbno >= xfs_ag_block_count(mp, pag->pag_agno)) {
 			xfs_alert(mp,
 		"%s: agbno (0x%llx) >= mp->m_sb.sb_agblocks (0x%lx)",
 				__func__, (unsigned long long)agbno,
-				(unsigned long)mp->m_sb.sb_agblocks);
+				(unsigned long)xfs_ag_block_count(mp,
+							pag->pag_agno));
 		}
 		if (ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino)) {
 			xfs_alert(mp,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] xfs: fix sparse inode limits on runt AG
  2024-11-12 22:05 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
@ 2024-11-12 23:15   ` Darrick J. Wong
  2024-11-13  0:12     ` Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-11-12 23:15 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, cem

On Wed, Nov 13, 2024 at 09:05:14AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> The runt AG at the end of a filesystem is almost always smaller than
> the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
> limit for the inode chunk allocation, we do not take this into
> account. This means we can allocate a sparse inode chunk that
> overlaps beyond the end of an AG. When we go to allocate an inode
> from that sparse chunk, the irec fails validation because the
> agbno of the start of the irec is beyond valid limits for the runt
> AG.
> 
> Prevent this from happening by taking into account the size of the
> runt AG when allocating inode chunks. Also convert the various
> checks for valid inode chunk agbnos to use xfs_ag_block_count()
> so that they will also catch such issues in the future.
> 
> Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")

Cc: <stable@vger.kernel.org> # v4.2

> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/libxfs/xfs_ialloc.c | 16 +++++++++-------
>  1 file changed, 9 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> index 271855227514..6258527315f2 100644
> --- a/fs/xfs/libxfs/xfs_ialloc.c
> +++ b/fs/xfs/libxfs/xfs_ialloc.c
> @@ -855,7 +855,8 @@ xfs_ialloc_ag_alloc(
>  		 * the end of the AG.
>  		 */
>  		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
> -		args.max_agbno = round_down(args.mp->m_sb.sb_agblocks,
> +		args.max_agbno = round_down(xfs_ag_block_count(args.mp,
> +							pag->pag_agno),

So if I'm reading this right, the inode cluster allocation checks now
enforce that we cannot search for free space beyond the actual end of
the AG, rounded down per inode alignment rules?

In that case, can this use the cached ag block count:

		args.max_agbno = round_down(
					pag_group(pag)->xg_block_count,
					args.mp->m_sb.sb_inoalignmt);

rather than recomputing the block count every time?

>  					    args.mp->m_sb.sb_inoalignmt) -
>  				 igeo->ialloc_blks;
>  
> @@ -2332,9 +2333,9 @@ xfs_difree(
>  		return -EINVAL;
>  	}
>  	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
> -	if (agbno >= mp->m_sb.sb_agblocks)  {
> -		xfs_warn(mp, "%s: agbno >= mp->m_sb.sb_agblocks (%d >= %d).",
> -			__func__, agbno, mp->m_sb.sb_agblocks);
> +	if (agbno >= xfs_ag_block_count(mp, pag->pag_agno)) {
> +		xfs_warn(mp, "%s: agbno >= xfs_ag_block_count (%d >= %d).",
> +			__func__, agbno, xfs_ag_block_count(mp, pag->pag_agno));

and everywhere else too?

(The logic itself looks ok)

--D

>  		ASSERT(0);
>  		return -EINVAL;
>  	}
> @@ -2457,7 +2458,7 @@ xfs_imap(
>  	 */
>  	agino = XFS_INO_TO_AGINO(mp, ino);
>  	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
> -	if (agbno >= mp->m_sb.sb_agblocks ||
> +	if (agbno >= xfs_ag_block_count(mp, pag->pag_agno) ||
>  	    ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino)) {
>  		error = -EINVAL;
>  #ifdef DEBUG
> @@ -2467,11 +2468,12 @@ xfs_imap(
>  		 */
>  		if (flags & XFS_IGET_UNTRUSTED)
>  			return error;
> -		if (agbno >= mp->m_sb.sb_agblocks) {
> +		if (agbno >= xfs_ag_block_count(mp, pag->pag_agno)) {
>  			xfs_alert(mp,
>  		"%s: agbno (0x%llx) >= mp->m_sb.sb_agblocks (0x%lx)",
>  				__func__, (unsigned long long)agbno,
> -				(unsigned long)mp->m_sb.sb_agblocks);
> +				(unsigned long)xfs_ag_block_count(mp,
> +							pag->pag_agno));
>  		}
>  		if (ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino)) {
>  			xfs_alert(mp,
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] xfs: fix sparse inode limits on runt AG
  2024-11-12 23:15   ` Darrick J. Wong
@ 2024-11-13  0:12     ` Dave Chinner
  2024-11-13  0:24       ` Darrick J. Wong
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2024-11-13  0:12 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, cem

On Tue, Nov 12, 2024 at 03:15:39PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 13, 2024 at 09:05:14AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > The runt AG at the end of a filesystem is almost always smaller than
> > the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
> > limit for the inode chunk allocation, we do not take this into
> > account. This means we can allocate a sparse inode chunk that
> > overlaps beyond the end of an AG. When we go to allocate an inode
> > from that sparse chunk, the irec fails validation because the
> > agbno of the start of the irec is beyond valid limits for the runt
> > AG.
> > 
> > Prevent this from happening by taking into account the size of the
> > runt AG when allocating inode chunks. Also convert the various
> > checks for valid inode chunk agbnos to use xfs_ag_block_count()
> > so that they will also catch such issues in the future.
> > 
> > Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
> 
> Cc: <stable@vger.kernel.org> # v4.2
> 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > ---
> >  fs/xfs/libxfs/xfs_ialloc.c | 16 +++++++++-------
> >  1 file changed, 9 insertions(+), 7 deletions(-)
> > 
> > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > index 271855227514..6258527315f2 100644
> > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > @@ -855,7 +855,8 @@ xfs_ialloc_ag_alloc(
> >  		 * the end of the AG.
> >  		 */
> >  		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
> > -		args.max_agbno = round_down(args.mp->m_sb.sb_agblocks,
> > +		args.max_agbno = round_down(xfs_ag_block_count(args.mp,
> > +							pag->pag_agno),
> 
> So if I'm reading this right, the inode cluster allocation checks now
> enforce that we cannot search for free space beyond the actual end of
> the AG, rounded down per inode alignment rules?
> 
> In that case, can this use the cached ag block count:
> 
> 		args.max_agbno = round_down(
> 					pag_group(pag)->xg_block_count,
> 					args.mp->m_sb.sb_inoalignmt);
> 
> rather than recomputing the block count every time?

Eventually, yes. I have another series that pushes the pag further
into these AG size checks all over the code to try to avoid this
entire class of bug in the future (i.e. blindly using mp->m_sb ag
parameters without considering the last AG is a runt).

I am waiting for the group rework to land
before I did any more work on that conversion. However, it is not
yet in the for-next branch, so I'm assuming that it isn't due to be
merged in the upcoming merge window because that's about to open
and none of that code has seen any time in linux-next of fs-next...

I want this fix to land sooner rather than later, so I used
xfs_ag_block_count() to avoid conflicts with pending work as well
as not require me to chase random dev branches to submit what is a
relatively simple bug fix....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] xfs: fix sparse inode limits on runt AG
  2024-11-13  0:12     ` Dave Chinner
@ 2024-11-13  0:24       ` Darrick J. Wong
  0 siblings, 0 replies; 17+ messages in thread
From: Darrick J. Wong @ 2024-11-13  0:24 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, cem

On Wed, Nov 13, 2024 at 11:12:02AM +1100, Dave Chinner wrote:
> On Tue, Nov 12, 2024 at 03:15:39PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 13, 2024 at 09:05:14AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > The runt AG at the end of a filesystem is almost always smaller than
> > > the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
> > > limit for the inode chunk allocation, we do not take this into
> > > account. This means we can allocate a sparse inode chunk that
> > > overlaps beyond the end of an AG. When we go to allocate an inode
> > > from that sparse chunk, the irec fails validation because the
> > > agbno of the start of the irec is beyond valid limits for the runt
> > > AG.
> > > 
> > > Prevent this from happening by taking into account the size of the
> > > runt AG when allocating inode chunks. Also convert the various
> > > checks for valid inode chunk agbnos to use xfs_ag_block_count()
> > > so that they will also catch such issues in the future.
> > > 
> > > Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
> > 
> > Cc: <stable@vger.kernel.org> # v4.2
> > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_ialloc.c | 16 +++++++++-------
> > >  1 file changed, 9 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
> > > index 271855227514..6258527315f2 100644
> > > --- a/fs/xfs/libxfs/xfs_ialloc.c
> > > +++ b/fs/xfs/libxfs/xfs_ialloc.c
> > > @@ -855,7 +855,8 @@ xfs_ialloc_ag_alloc(
> > >  		 * the end of the AG.
> > >  		 */
> > >  		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
> > > -		args.max_agbno = round_down(args.mp->m_sb.sb_agblocks,
> > > +		args.max_agbno = round_down(xfs_ag_block_count(args.mp,
> > > +							pag->pag_agno),
> > 
> > So if I'm reading this right, the inode cluster allocation checks now
> > enforce that we cannot search for free space beyond the actual end of
> > the AG, rounded down per inode alignment rules?
> > 
> > In that case, can this use the cached ag block count:
> > 
> > 		args.max_agbno = round_down(
> > 					pag_group(pag)->xg_block_count,
> > 					args.mp->m_sb.sb_inoalignmt);
> > 
> > rather than recomputing the block count every time?
> 
> Eventually, yes. I have another series that pushes the pag further
> into these AG size checks all over the code to try to avoid this
> entire class of bug in the future (i.e. blindly using mp->m_sb ag
> parameters without considering the last AG is a runt).
> 
> I am waiting for the group rework to land
> before I did any more work on that conversion. However, it is not
> yet in the for-next branch, so I'm assuming that it isn't due to be
> merged in the upcoming merge window because that's about to open
> and none of that code has seen any time in linux-next of fs-next...

...let's hope that slip doesn't happen. :(

> I want this fix to land sooner rather than later, so I used
> xfs_ag_block_count() to avoid conflicts with pending work as well
> as not require me to chase random dev branches to submit what is a
> relatively simple bug fix....

Aha, I wondered if that was why you were asking if cem was planning to
push things to for-next.  He said during office hours that he'd push the
metadir/rtgroups stuff Wednesday morning.

With the xfs_ag_block_count calls replaced if the generic groups rework
*does* land (or as it is now if it doesn't),

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent
  2024-11-12 22:05 [PATCH 0/3] xfs: miscellaneous bug fixes Dave Chinner
  2024-11-12 22:05 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
@ 2024-11-12 22:05 ` Dave Chinner
  2024-11-12 23:48   ` Darrick J. Wong
  2024-11-13  8:48   ` Christoph Hellwig
  2024-11-12 22:05 ` [PATCH 3/3] xfs: prevent mount and log shutdown race Dave Chinner
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2024-11-12 22:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: cem

From: Supriya Wickrematillake <sup@sgi.com>

I've been seeing this failure on during xfs/050 recently:

 XFS: Assertion failed: dst->d_spc_timer != 0, file: fs/xfs/xfs_qm_syscalls.c, line: 435
....
 Call Trace:
  <TASK>
  xfs_qm_scall_getquota_fill_qc+0x2a2/0x2b0
  xfs_qm_scall_getquota_next+0x69/0xa0
  xfs_fs_get_nextdqblk+0x62/0xf0
  quota_getnextxquota+0xbf/0x320
  do_quotactl+0x1a1/0x410
  __se_sys_quotactl+0x126/0x310
  __x64_sys_quotactl+0x21/0x30
  x64_sys_call+0x2819/0x2ee0
  do_syscall_64+0x68/0x130
  entry_SYSCALL_64_after_hwframe+0x76/0x7e

It turns out that the _qmount call has silently been failing to
unmount and mount the filesystem, so when the softlimit is pushed
past with a buffered write, it is not getting synced to disk before
the next quota report is being run.

Hence when the quota report runs, we have 300 blocks of delalloc
data on an inode, with a soft limit of 200 blocks. XFS dquots
account delalloc reservations as used space, hence the dquot is over
the soft limit.

However, we don't update the soft limit timers until we do a
transactional update of the dquot. That is, the dquot sits over the
soft limit without a softlimit timer being started until writeback
occurs and the allocation modifies the dquot and we call
xfs_qm_adjust_dqtimers() from xfs_trans_apply_dquot_deltas() in
xfs_trans_commit() context.

This isn't really a problem, except for this debug code in
xfs_qm_scall_getquota_fill_qc():

#ifdef DEBUG
        if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
                if ((dst->d_space > dst->d_spc_softlimit) &&
                    (dst->d_spc_softlimit > 0)) {
                        ASSERT(dst->d_spc_timer != 0);
                }
....

It asserts taht if the used block count is over the soft limit,
it *must* have a soft limit timer running. This is clearly not
the case, because we haven't committed the delalloc space to disk
yet. Hence the soft limit is only exceeded temporarily in memory
(which isn't an issue) and we start the timer the moment we exceed
the soft limit in journalled metadata.

This debug was introduced in:

commit 0d5ad8383061fbc0a9804fbb98218750000fe032
Author: Supriya Wickrematillake <sup@sgi.com>
Date:   Wed May 15 22:44:44 1996 +0000

    initial checkin
    quotactl syscall functions.

The very first quota support commit back in 1996. This is zero-day
debug for Irix and, as it turns out, a zero-day bug in the debug
code because the delalloc code on Irix didn't update the softlimit
timers, either.

IOWs, this issue has been in the code for 28 years.

We obviously don't care if soft limit timers are a bit rubbery when
we have delalloc reservations in memory. Production systems running
quota reports have been exposed to this situation for 28 years and
nobody has noticed it, so the debug code is essentially worthless at
this point in time.

We also have the on-disk dquot verifiers checking that the soft
limit timer is running whenever the dquot is over the soft limit
before we write it to disk and after we read it from disk. These
aren't firing, so it is clear the issue is purely a temporary
in-memory incoherency that I never would have noticed had the test
not silently failed to unmount the filesystem.

Hence I'm simply going to trash this runtime debug because it isn't
useful in the slightest for catching quota bugs.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_qm_syscalls.c | 13 -------------
 1 file changed, 13 deletions(-)

diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index 4eda50ae2d1c..0c78f30fa4a3 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -427,19 +427,6 @@ xfs_qm_scall_getquota_fill_qc(
 		dst->d_ino_timer = 0;
 		dst->d_rt_spc_timer = 0;
 	}
-
-#ifdef DEBUG
-	if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
-		if ((dst->d_space > dst->d_spc_softlimit) &&
-		    (dst->d_spc_softlimit > 0)) {
-			ASSERT(dst->d_spc_timer != 0);
-		}
-		if ((dst->d_ino_count > dqp->q_ino.softlimit) &&
-		    (dqp->q_ino.softlimit > 0)) {
-			ASSERT(dst->d_ino_timer != 0);
-		}
-	}
-#endif
 }
 
 /* Return the quota information for the dquot matching id. */
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent
  2024-11-12 22:05 ` [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent Dave Chinner
@ 2024-11-12 23:48   ` Darrick J. Wong
  2024-11-13  0:14     ` Dave Chinner
  2024-11-13  8:48   ` Christoph Hellwig
  1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-11-12 23:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, cem

On Wed, Nov 13, 2024 at 09:05:15AM +1100, Dave Chinner wrote:
> From: Supriya Wickrematillake <sup@sgi.com>

Wow, there are still people working at SGI??

> I've been seeing this failure on during xfs/050 recently:
> 
>  XFS: Assertion failed: dst->d_spc_timer != 0, file: fs/xfs/xfs_qm_syscalls.c, line: 435
> ....
>  Call Trace:
>   <TASK>
>   xfs_qm_scall_getquota_fill_qc+0x2a2/0x2b0
>   xfs_qm_scall_getquota_next+0x69/0xa0
>   xfs_fs_get_nextdqblk+0x62/0xf0
>   quota_getnextxquota+0xbf/0x320
>   do_quotactl+0x1a1/0x410
>   __se_sys_quotactl+0x126/0x310
>   __x64_sys_quotactl+0x21/0x30
>   x64_sys_call+0x2819/0x2ee0
>   do_syscall_64+0x68/0x130
>   entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> It turns out that the _qmount call has silently been failing to
> unmount and mount the filesystem, so when the softlimit is pushed
> past with a buffered write, it is not getting synced to disk before
> the next quota report is being run.
> 
> Hence when the quota report runs, we have 300 blocks of delalloc
> data on an inode, with a soft limit of 200 blocks. XFS dquots
> account delalloc reservations as used space, hence the dquot is over
> the soft limit.
> 
> However, we don't update the soft limit timers until we do a
> transactional update of the dquot. That is, the dquot sits over the
> soft limit without a softlimit timer being started until writeback
> occurs and the allocation modifies the dquot and we call
> xfs_qm_adjust_dqtimers() from xfs_trans_apply_dquot_deltas() in
> xfs_trans_commit() context.
> 
> This isn't really a problem, except for this debug code in
> xfs_qm_scall_getquota_fill_qc():
> 
> #ifdef DEBUG
>         if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
>                 if ((dst->d_space > dst->d_spc_softlimit) &&
>                     (dst->d_spc_softlimit > 0)) {
>                         ASSERT(dst->d_spc_timer != 0);
>                 }
> ....
> 
> It asserts taht if the used block count is over the soft limit,
> it *must* have a soft limit timer running. This is clearly not
> the case, because we haven't committed the delalloc space to disk
> yet. Hence the soft limit is only exceeded temporarily in memory
> (which isn't an issue) and we start the timer the moment we exceed
> the soft limit in journalled metadata.
> 
> This debug was introduced in:
> 
> commit 0d5ad8383061fbc0a9804fbb98218750000fe032
> Author: Supriya Wickrematillake <sup@sgi.com>
> Date:   Wed May 15 22:44:44 1996 +0000
> 
>     initial checkin
>     quotactl syscall functions.
> 
> The very first quota support commit back in 1996. This is zero-day
> debug for Irix and, as it turns out, a zero-day bug in the debug
> code because the delalloc code on Irix didn't update the softlimit
> timers, either.
> 
> IOWs, this issue has been in the code for 28 years.
> 
> We obviously don't care if soft limit timers are a bit rubbery when
> we have delalloc reservations in memory. Production systems running
> quota reports have been exposed to this situation for 28 years and
> nobody has noticed it, so the debug code is essentially worthless at
> this point in time.
> 
> We also have the on-disk dquot verifiers checking that the soft
> limit timer is running whenever the dquot is over the soft limit
> before we write it to disk and after we read it from disk. These
> aren't firing, so it is clear the issue is purely a temporary
> in-memory incoherency that I never would have noticed had the test
> not silently failed to unmount the filesystem.
> 
> Hence I'm simply going to trash this runtime debug because it isn't
> useful in the slightest for catching quota bugs.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Agreed!  I've hit this once in a blue moon and didn't think it was
especially useful either.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  fs/xfs/xfs_qm_syscalls.c | 13 -------------
>  1 file changed, 13 deletions(-)
> 
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index 4eda50ae2d1c..0c78f30fa4a3 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -427,19 +427,6 @@ xfs_qm_scall_getquota_fill_qc(
>  		dst->d_ino_timer = 0;
>  		dst->d_rt_spc_timer = 0;
>  	}
> -
> -#ifdef DEBUG
> -	if (xfs_dquot_is_enforced(dqp) && dqp->q_id != 0) {
> -		if ((dst->d_space > dst->d_spc_softlimit) &&
> -		    (dst->d_spc_softlimit > 0)) {
> -			ASSERT(dst->d_spc_timer != 0);
> -		}
> -		if ((dst->d_ino_count > dqp->q_ino.softlimit) &&
> -		    (dqp->q_ino.softlimit > 0)) {
> -			ASSERT(dst->d_ino_timer != 0);
> -		}
> -	}
> -#endif
>  }
>  
>  /* Return the quota information for the dquot matching id. */
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent
  2024-11-12 23:48   ` Darrick J. Wong
@ 2024-11-13  0:14     ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2024-11-13  0:14 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, cem

On Tue, Nov 12, 2024 at 03:48:35PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 13, 2024 at 09:05:15AM +1100, Dave Chinner wrote:
> > From: Supriya Wickrematillake <sup@sgi.com>
> 
> Wow, there are still people working at SGI??

huh, that's a git-send-email screwup. It must have pulled that from
the quote of the initial quota commit:

> > This debug was introduced in:
> > 
> > commit 0d5ad8383061fbc0a9804fbb98218750000fe032
> > Author: Supriya Wickrematillake <sup@sgi.com>
> > Date:   Wed May 15 22:44:44 1996 +0000
> > 
> >     initial checkin
> >     quotactl syscall functions.

Here, and decided to ignore the actual "From: dchinner@redhat" tag
in the local commit message.

Great work, git!


> > The very first quota support commit back in 1996. This is zero-day
> > debug for Irix and, as it turns out, a zero-day bug in the debug
> > code because the delalloc code on Irix didn't update the softlimit
> > timers, either.
> > 
> > IOWs, this issue has been in the code for 28 years.
> > 
> > We obviously don't care if soft limit timers are a bit rubbery when
> > we have delalloc reservations in memory. Production systems running
> > quota reports have been exposed to this situation for 28 years and
> > nobody has noticed it, so the debug code is essentially worthless at
> > this point in time.
> > 
> > We also have the on-disk dquot verifiers checking that the soft
> > limit timer is running whenever the dquot is over the soft limit
> > before we write it to disk and after we read it from disk. These
> > aren't firing, so it is clear the issue is purely a temporary
> > in-memory incoherency that I never would have noticed had the test
> > not silently failed to unmount the filesystem.
> > 
> > Hence I'm simply going to trash this runtime debug because it isn't
> > useful in the slightest for catching quota bugs.
> > 
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> 
> Agreed!  I've hit this once in a blue moon and didn't think it was
> especially useful either.
> 
> Reviewed-by: Darrick J. Wong <djwong@kernel.org>

Thanks!

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent
  2024-11-12 22:05 ` [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent Dave Chinner
  2024-11-12 23:48   ` Darrick J. Wong
@ 2024-11-13  8:48   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-11-13  8:48 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, cem

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 3/3] xfs: prevent mount and log shutdown race
  2024-11-12 22:05 [PATCH 0/3] xfs: miscellaneous bug fixes Dave Chinner
  2024-11-12 22:05 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
  2024-11-12 22:05 ` [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent Dave Chinner
@ 2024-11-12 22:05 ` Dave Chinner
  2024-11-12 23:58   ` Darrick J. Wong
  2024-11-13  8:50   ` Christoph Hellwig
  2024-11-12 23:59 ` [PATCH 0/3] xfs: miscellaneous bug fixes Darrick J. Wong
  2024-11-25 11:57 ` Carlos Maiolino
  4 siblings, 2 replies; 17+ messages in thread
From: Dave Chinner @ 2024-11-12 22:05 UTC (permalink / raw)
  To: linux-xfs; +Cc: cem

From: Dave Chinner <dchinner@redhat.com>

I recently had an fstests hang where there were two internal tasks
stuck like so:

[ 6559.010870] task:kworker/24:45   state:D stack:12152 pid:631308 tgid:631308 ppid:2      flags:0x00004000
[ 6559.016984] Workqueue: xfs-buf/dm-2 xfs_buf_ioend_work
[ 6559.020349] Call Trace:
[ 6559.022002]  <TASK>
[ 6559.023426]  __schedule+0x650/0xb10
[ 6559.025734]  schedule+0x6d/0xf0
[ 6559.027835]  schedule_timeout+0x31/0x180
[ 6559.030582]  wait_for_common+0x10c/0x1e0
[ 6559.033495]  wait_for_completion+0x1d/0x30
[ 6559.036463]  __flush_workqueue+0xeb/0x490
[ 6559.039479]  ? mempool_alloc_slab+0x15/0x20
[ 6559.042537]  xlog_cil_force_seq+0xa1/0x2f0
[ 6559.045498]  ? bio_alloc_bioset+0x1d8/0x510
[ 6559.048578]  ? submit_bio_noacct+0x2f2/0x380
[ 6559.051665]  ? xlog_force_shutdown+0x3b/0x170
[ 6559.054819]  xfs_log_force+0x77/0x230
[ 6559.057455]  xlog_force_shutdown+0x3b/0x170
[ 6559.060507]  xfs_do_force_shutdown+0xd4/0x200
[ 6559.063798]  ? xfs_buf_rele+0x1bd/0x580
[ 6559.066541]  xfs_buf_ioend_handle_error+0x163/0x2e0
[ 6559.070099]  xfs_buf_ioend+0x61/0x200
[ 6559.072728]  xfs_buf_ioend_work+0x15/0x20
[ 6559.075706]  process_scheduled_works+0x1d4/0x400
[ 6559.078814]  worker_thread+0x234/0x2e0
[ 6559.081300]  kthread+0x147/0x170
[ 6559.083462]  ? __pfx_worker_thread+0x10/0x10
[ 6559.086295]  ? __pfx_kthread+0x10/0x10
[ 6559.088771]  ret_from_fork+0x3e/0x50
[ 6559.091153]  ? __pfx_kthread+0x10/0x10
[ 6559.093624]  ret_from_fork_asm+0x1a/0x30
[ 6559.096227]  </TASK>

[ 6559.109304] Workqueue: xfs-cil/dm-2 xlog_cil_push_work
[ 6559.112673] Call Trace:
[ 6559.114333]  <TASK>
[ 6559.115760]  __schedule+0x650/0xb10
[ 6559.118084]  schedule+0x6d/0xf0
[ 6559.120175]  schedule_timeout+0x31/0x180
[ 6559.122776]  ? call_rcu+0xee/0x2f0
[ 6559.125034]  __down_common+0xbe/0x1f0
[ 6559.127470]  __down+0x1d/0x30
[ 6559.129458]  down+0x48/0x50
[ 6559.131343]  ? xfs_buf_item_unpin+0x8d/0x380
[ 6559.134213]  xfs_buf_lock+0x3d/0xe0
[ 6559.136544]  xfs_buf_item_unpin+0x8d/0x380
[ 6559.139253]  xlog_cil_committed+0x287/0x520
[ 6559.142019]  ? sched_clock+0x10/0x30
[ 6559.144384]  ? sched_clock_cpu+0x10/0x190
[ 6559.147039]  ? psi_group_change+0x48/0x310
[ 6559.149735]  ? _raw_spin_unlock+0xe/0x30
[ 6559.152340]  ? finish_task_switch+0xbc/0x310
[ 6559.155163]  xlog_cil_process_committed+0x6d/0x90
[ 6559.158265]  xlog_state_shutdown_callbacks+0x53/0x110
[ 6559.161564]  ? xlog_cil_push_work+0xa70/0xaf0
[ 6559.164441]  xlog_state_release_iclog+0xba/0x1b0
[ 6559.167483]  xlog_cil_push_work+0xa70/0xaf0
[ 6559.170260]  process_scheduled_works+0x1d4/0x400
[ 6559.173286]  worker_thread+0x234/0x2e0
[ 6559.175779]  kthread+0x147/0x170
[ 6559.177933]  ? __pfx_worker_thread+0x10/0x10
[ 6559.180748]  ? __pfx_kthread+0x10/0x10
[ 6559.183231]  ret_from_fork+0x3e/0x50
[ 6559.185601]  ? __pfx_kthread+0x10/0x10
[ 6559.188092]  ret_from_fork_asm+0x1a/0x30
[ 6559.190692]  </TASK>

This is an ABBA deadlock where buffer IO completion is triggering a
forced shutdown with the buffer lock held. It is waiting for the CIL
to flush as part of the log force. The CIL flush is blocked doing
shutdown processing of all it's objects, trying to unpin a buffer
item. That requires taking the buffer lock....

For the CIL to be doing shutdown processing, the log must be marked
with XLOG_IO_ERROR, but that doesn't happen until after the log
force is issued. Hence for xfs_do_force_shutdown() to be forcing
the log on a shut down log, we must have had a racing
xlog_force_shutdown and xfs_force_shutdown like so:

p0			p1			CIL push

   			<holds buffer lock>
xlog_force_shutdown
  xfs_log_force
   test_and_set_bit(XLOG_IO_ERROR)
   						xlog_state_release_iclog()
						  sees XLOG_IO_ERROR
						  xlog_state_shutdown_callbacks
						    ....
						    xfs_buf_item_unpin
						    xfs_buf_lock
						    <blocks on buffer p1 holds>

   			xfs_force_shutdown
			  xfs_set_shutdown(mp) wins
			    xlog_force_shutdown
			      xfs_log_force
			        <blocks on CIL push>

  xfs_set_shutdown(mp) fails
  <shuts down rest of log>

The deadlock can be mitigated by avoiding the log force on the
second pass through xlog_force_shutdown. Do this by adding another
atomic state bit (XLOG_OP_PENDING_SHUTDOWN) that is set on entry to
xlog_force_shutdown() but doesn't mark the log as shutdown.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_log.c      | 11 +++++++++++
 fs/xfs/xfs_log_priv.h |  1 +
 2 files changed, 12 insertions(+)

diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
index 26b2f5887b88..05daad8a8d34 100644
--- a/fs/xfs/xfs_log.c
+++ b/fs/xfs/xfs_log.c
@@ -3455,6 +3455,16 @@ xlog_force_shutdown(
 	if (!log)
 		return false;
 
+	/*
+	 * Ensure that there is only ever one log shutdown being processed.
+	 * If we allow the log force below on a second pass after shutting
+	 * down the log, we risk deadlocking the CIL push as it may require
+	 * locks on objects the current shutdown context holds (e.g. taking
+	 * buffer locks to abort buffers on last unpin of buf log items).
+	 */
+	if (test_and_set_bit(XLOG_SHUTDOWN_STARTED, &log->l_opstate))
+		return false;
+
 	/*
 	 * Flush all the completed transactions to disk before marking the log
 	 * being shut down. We need to do this first as shutting down the log
@@ -3487,6 +3497,7 @@ xlog_force_shutdown(
 	spin_lock(&log->l_icloglock);
 	if (test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) {
 		spin_unlock(&log->l_icloglock);
+		ASSERT(0);
 		return false;
 	}
 	spin_unlock(&log->l_icloglock);
diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
index b8778a4fd6b6..f3d78869e5e5 100644
--- a/fs/xfs/xfs_log_priv.h
+++ b/fs/xfs/xfs_log_priv.h
@@ -458,6 +458,7 @@ struct xlog {
 #define XLOG_IO_ERROR		2	/* log hit an I/O error, and being
 				   shutdown */
 #define XLOG_TAIL_WARN		3	/* log tail verify warning issued */
+#define XLOG_SHUTDOWN_STARTED	4	/* xlog_force_shutdown() exclusion */
 
 static inline bool
 xlog_recovery_needed(struct xlog *log)
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] xfs: prevent mount and log shutdown race
  2024-11-12 22:05 ` [PATCH 3/3] xfs: prevent mount and log shutdown race Dave Chinner
@ 2024-11-12 23:58   ` Darrick J. Wong
  2024-11-13  0:56     ` Dave Chinner
  2024-11-13  8:50   ` Christoph Hellwig
  1 sibling, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-11-12 23:58 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, cem

On Wed, Nov 13, 2024 at 09:05:16AM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> I recently had an fstests hang where there were two internal tasks
> stuck like so:
> 
> [ 6559.010870] task:kworker/24:45   state:D stack:12152 pid:631308 tgid:631308 ppid:2      flags:0x00004000
> [ 6559.016984] Workqueue: xfs-buf/dm-2 xfs_buf_ioend_work
> [ 6559.020349] Call Trace:
> [ 6559.022002]  <TASK>
> [ 6559.023426]  __schedule+0x650/0xb10
> [ 6559.025734]  schedule+0x6d/0xf0
> [ 6559.027835]  schedule_timeout+0x31/0x180
> [ 6559.030582]  wait_for_common+0x10c/0x1e0
> [ 6559.033495]  wait_for_completion+0x1d/0x30
> [ 6559.036463]  __flush_workqueue+0xeb/0x490
> [ 6559.039479]  ? mempool_alloc_slab+0x15/0x20
> [ 6559.042537]  xlog_cil_force_seq+0xa1/0x2f0
> [ 6559.045498]  ? bio_alloc_bioset+0x1d8/0x510
> [ 6559.048578]  ? submit_bio_noacct+0x2f2/0x380
> [ 6559.051665]  ? xlog_force_shutdown+0x3b/0x170
> [ 6559.054819]  xfs_log_force+0x77/0x230
> [ 6559.057455]  xlog_force_shutdown+0x3b/0x170
> [ 6559.060507]  xfs_do_force_shutdown+0xd4/0x200
> [ 6559.063798]  ? xfs_buf_rele+0x1bd/0x580
> [ 6559.066541]  xfs_buf_ioend_handle_error+0x163/0x2e0
> [ 6559.070099]  xfs_buf_ioend+0x61/0x200
> [ 6559.072728]  xfs_buf_ioend_work+0x15/0x20
> [ 6559.075706]  process_scheduled_works+0x1d4/0x400
> [ 6559.078814]  worker_thread+0x234/0x2e0
> [ 6559.081300]  kthread+0x147/0x170
> [ 6559.083462]  ? __pfx_worker_thread+0x10/0x10
> [ 6559.086295]  ? __pfx_kthread+0x10/0x10
> [ 6559.088771]  ret_from_fork+0x3e/0x50
> [ 6559.091153]  ? __pfx_kthread+0x10/0x10
> [ 6559.093624]  ret_from_fork_asm+0x1a/0x30
> [ 6559.096227]  </TASK>
> 
> [ 6559.109304] Workqueue: xfs-cil/dm-2 xlog_cil_push_work
> [ 6559.112673] Call Trace:
> [ 6559.114333]  <TASK>
> [ 6559.115760]  __schedule+0x650/0xb10
> [ 6559.118084]  schedule+0x6d/0xf0
> [ 6559.120175]  schedule_timeout+0x31/0x180
> [ 6559.122776]  ? call_rcu+0xee/0x2f0
> [ 6559.125034]  __down_common+0xbe/0x1f0
> [ 6559.127470]  __down+0x1d/0x30
> [ 6559.129458]  down+0x48/0x50
> [ 6559.131343]  ? xfs_buf_item_unpin+0x8d/0x380
> [ 6559.134213]  xfs_buf_lock+0x3d/0xe0
> [ 6559.136544]  xfs_buf_item_unpin+0x8d/0x380
> [ 6559.139253]  xlog_cil_committed+0x287/0x520
> [ 6559.142019]  ? sched_clock+0x10/0x30
> [ 6559.144384]  ? sched_clock_cpu+0x10/0x190
> [ 6559.147039]  ? psi_group_change+0x48/0x310
> [ 6559.149735]  ? _raw_spin_unlock+0xe/0x30
> [ 6559.152340]  ? finish_task_switch+0xbc/0x310
> [ 6559.155163]  xlog_cil_process_committed+0x6d/0x90
> [ 6559.158265]  xlog_state_shutdown_callbacks+0x53/0x110
> [ 6559.161564]  ? xlog_cil_push_work+0xa70/0xaf0
> [ 6559.164441]  xlog_state_release_iclog+0xba/0x1b0
> [ 6559.167483]  xlog_cil_push_work+0xa70/0xaf0
> [ 6559.170260]  process_scheduled_works+0x1d4/0x400
> [ 6559.173286]  worker_thread+0x234/0x2e0
> [ 6559.175779]  kthread+0x147/0x170
> [ 6559.177933]  ? __pfx_worker_thread+0x10/0x10
> [ 6559.180748]  ? __pfx_kthread+0x10/0x10
> [ 6559.183231]  ret_from_fork+0x3e/0x50
> [ 6559.185601]  ? __pfx_kthread+0x10/0x10
> [ 6559.188092]  ret_from_fork_asm+0x1a/0x30
> [ 6559.190692]  </TASK>
> 
> This is an ABBA deadlock where buffer IO completion is triggering a
> forced shutdown with the buffer lock held. It is waiting for the CIL
> to flush as part of the log force. The CIL flush is blocked doing
> shutdown processing of all it's objects, trying to unpin a buffer
> item. That requires taking the buffer lock....
> 
> For the CIL to be doing shutdown processing, the log must be marked
> with XLOG_IO_ERROR, but that doesn't happen until after the log
> force is issued. Hence for xfs_do_force_shutdown() to be forcing
> the log on a shut down log, we must have had a racing
> xlog_force_shutdown and xfs_force_shutdown like so:
> 
> p0			p1			CIL push
> 
>    			<holds buffer lock>
> xlog_force_shutdown
>   xfs_log_force
>    test_and_set_bit(XLOG_IO_ERROR)
>    						xlog_state_release_iclog()
> 						  sees XLOG_IO_ERROR
> 						  xlog_state_shutdown_callbacks
> 						    ....
> 						    xfs_buf_item_unpin
> 						    xfs_buf_lock
> 						    <blocks on buffer p1 holds>
> 
>    			xfs_force_shutdown
> 			  xfs_set_shutdown(mp) wins
> 			    xlog_force_shutdown
> 			      xfs_log_force
> 			        <blocks on CIL push>
> 
>   xfs_set_shutdown(mp) fails
>   <shuts down rest of log>

Huh.  I wonder, what happens today if there are multiple threads all
trying to shut down the log?  Same thing?  I guess we've never really
gone Farmer Brown's Bad Day[1] on this part of the fs.

Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

[1] https://www.gocomics.com/calvinandhobbes/1993/04/11

> The deadlock can be mitigated by avoiding the log force on the
> second pass through xlog_force_shutdown. Do this by adding another
> atomic state bit (XLOG_OP_PENDING_SHUTDOWN) that is set on entry to
> xlog_force_shutdown() but doesn't mark the log as shutdown.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/xfs/xfs_log.c      | 11 +++++++++++
>  fs/xfs/xfs_log_priv.h |  1 +
>  2 files changed, 12 insertions(+)
> 
> diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c
> index 26b2f5887b88..05daad8a8d34 100644
> --- a/fs/xfs/xfs_log.c
> +++ b/fs/xfs/xfs_log.c
> @@ -3455,6 +3455,16 @@ xlog_force_shutdown(
>  	if (!log)
>  		return false;
>  
> +	/*
> +	 * Ensure that there is only ever one log shutdown being processed.
> +	 * If we allow the log force below on a second pass after shutting
> +	 * down the log, we risk deadlocking the CIL push as it may require
> +	 * locks on objects the current shutdown context holds (e.g. taking
> +	 * buffer locks to abort buffers on last unpin of buf log items).
> +	 */
> +	if (test_and_set_bit(XLOG_SHUTDOWN_STARTED, &log->l_opstate))
> +		return false;
> +
>  	/*
>  	 * Flush all the completed transactions to disk before marking the log
>  	 * being shut down. We need to do this first as shutting down the log
> @@ -3487,6 +3497,7 @@ xlog_force_shutdown(
>  	spin_lock(&log->l_icloglock);
>  	if (test_and_set_bit(XLOG_IO_ERROR, &log->l_opstate)) {
>  		spin_unlock(&log->l_icloglock);
> +		ASSERT(0);
>  		return false;
>  	}
>  	spin_unlock(&log->l_icloglock);
> diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h
> index b8778a4fd6b6..f3d78869e5e5 100644
> --- a/fs/xfs/xfs_log_priv.h
> +++ b/fs/xfs/xfs_log_priv.h
> @@ -458,6 +458,7 @@ struct xlog {
>  #define XLOG_IO_ERROR		2	/* log hit an I/O error, and being
>  				   shutdown */
>  #define XLOG_TAIL_WARN		3	/* log tail verify warning issued */
> +#define XLOG_SHUTDOWN_STARTED	4	/* xlog_force_shutdown() exclusion */
>  
>  static inline bool
>  xlog_recovery_needed(struct xlog *log)
> -- 
> 2.45.2
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] xfs: prevent mount and log shutdown race
  2024-11-12 23:58   ` Darrick J. Wong
@ 2024-11-13  0:56     ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2024-11-13  0:56 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, cem

On Tue, Nov 12, 2024 at 03:58:08PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 13, 2024 at 09:05:16AM +1100, Dave Chinner wrote:
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > I recently had an fstests hang where there were two internal tasks
> > stuck like so:
....
> > For the CIL to be doing shutdown processing, the log must be marked
> > with XLOG_IO_ERROR, but that doesn't happen until after the log
> > force is issued. Hence for xfs_do_force_shutdown() to be forcing
> > the log on a shut down log, we must have had a racing
> > xlog_force_shutdown and xfs_force_shutdown like so:
> > 
> > p0			p1			CIL push
> > 
> >    			<holds buffer lock>
> > xlog_force_shutdown
> >   xfs_log_force
> >    test_and_set_bit(XLOG_IO_ERROR)
> >    						xlog_state_release_iclog()
> > 						  sees XLOG_IO_ERROR
> > 						  xlog_state_shutdown_callbacks
> > 						    ....
> > 						    xfs_buf_item_unpin
> > 						    xfs_buf_lock
> > 						    <blocks on buffer p1 holds>
> > 
> >    			xfs_force_shutdown
> > 			  xfs_set_shutdown(mp) wins
> > 			    xlog_force_shutdown
> > 			      xfs_log_force
> > 			        <blocks on CIL push>
> > 
> >   xfs_set_shutdown(mp) fails
> >   <shuts down rest of log>
> 
> Huh.  I wonder, what happens today if there are multiple threads all
> trying to shut down the log?  Same thing?

Yes. Anywhere that a both a log shutdown and a mount shutdown can be
called concurrently and one of them holds a locked buffer that is
also dirty in the CIL can trip over this. When I first saw it I
thought "calling shutdown with a locked buffer is bad", then
realised that we do that -everywhere- and assume it is safe to do
so. That's when I started looking deeper and found this....

> I guess we've never really
> gone Farmer Brown's Bad Day[1] on this part of the fs.

Oh, running ~64 individual fstests concurrently on the same VM does
a good imitation of that.

$ time sudo ./check-parallel /mnt/xfs -s xfs -x dump
Tests run: 1143
Failure count: 11

real    9m12.307s
user    0m0.007s
sys     0m0.013s
$

That's what's finding these little weird timing-related issues. I've got
several other repeating issues that I haven't got to the bottom of
yet, so Farmer Brown's Bad Day is not over yet...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] xfs: prevent mount and log shutdown race
  2024-11-12 22:05 ` [PATCH 3/3] xfs: prevent mount and log shutdown race Dave Chinner
  2024-11-12 23:58   ` Darrick J. Wong
@ 2024-11-13  8:50   ` Christoph Hellwig
  1 sibling, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-11-13  8:50 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, cem

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] xfs: miscellaneous bug fixes
  2024-11-12 22:05 [PATCH 0/3] xfs: miscellaneous bug fixes Dave Chinner
                   ` (2 preceding siblings ...)
  2024-11-12 22:05 ` [PATCH 3/3] xfs: prevent mount and log shutdown race Dave Chinner
@ 2024-11-12 23:59 ` Darrick J. Wong
  2024-11-13  1:09   ` Dave Chinner
  2024-11-25 11:57 ` Carlos Maiolino
  4 siblings, 1 reply; 17+ messages in thread
From: Darrick J. Wong @ 2024-11-12 23:59 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs, cem

On Wed, Nov 13, 2024 at 09:05:13AM +1100, Dave Chinner wrote:
> These are three bug fixes for recent issues.
> 
> The first is a repost of the original patch to prevent allocation of
> sparse inode clusters at the end of an unaligned runt AG. There
> was plenty of discussion over that fix here:
> 
> https://lore.kernel.org/linux-xfs/20241024025142.4082218-1-david@fromorbit.com/
> 
> And the outcome of that discussion is that we can't allow sparse
> inode clusters overlapping the end of the runt AG without an on disk
> format definition change. Hence this patch to ensure the check is
> done correctly is the only change we need to make to the kernel to
> avoid this problem in the future.
> 
> Filesystems that have this problem on disk will need to run
> xfs_repair to remove the bad cluster, but no data loss is possible
> from this because the kernel currently disallows inode allocation
> from the bad cluster and so none of the inodes in the sparse cluster
> can actually be used. Hence there is no possible data loss or other
> metadata corruption possible from this situation, all we need to do
> is ensure that it doesn't happen again once repair has done it's
> work.

<shrug> How many systems are in this state?  Would those users rather we
fix the validation code in repair/scrub/wherever to allow ichunks that
overrun the end of a runt AG?

--D

> The other two patches are for issues I've recently hit when running
> lots of fstests in parallel. That changes loading and hence timing
> of events during tests, exposing latent race conditions in the code.
> The quota fix removes racy debug code that has been there since the
> quota code was first committed in 1996.
> 
> The log shutdown race fix is a much more recent issue created by
> trying to ensure shutdowns operate in a sane and predictable manner.
> The logic flaw is that we allow multiple log shutdowns to start and
> force the log before selecting on a single log shutdown task. This
> leads to a situation where shutdown log item callback processing
> gets stuck waiting on a task holding a buffer lock that is waiting
> on a log force that is waiting on shutdown log item callback
> processing to complete...
> 
> Thoughts?
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] xfs: miscellaneous bug fixes
  2024-11-12 23:59 ` [PATCH 0/3] xfs: miscellaneous bug fixes Darrick J. Wong
@ 2024-11-13  1:09   ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2024-11-13  1:09 UTC (permalink / raw)
  To: Darrick J. Wong; +Cc: linux-xfs, cem

On Tue, Nov 12, 2024 at 03:59:46PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 13, 2024 at 09:05:13AM +1100, Dave Chinner wrote:
> > These are three bug fixes for recent issues.
> > 
> > The first is a repost of the original patch to prevent allocation of
> > sparse inode clusters at the end of an unaligned runt AG. There
> > was plenty of discussion over that fix here:
> > 
> > https://lore.kernel.org/linux-xfs/20241024025142.4082218-1-david@fromorbit.com/
> > 
> > And the outcome of that discussion is that we can't allow sparse
> > inode clusters overlapping the end of the runt AG without an on disk
> > format definition change. Hence this patch to ensure the check is
> > done correctly is the only change we need to make to the kernel to
> > avoid this problem in the future.
> > 
> > Filesystems that have this problem on disk will need to run
> > xfs_repair to remove the bad cluster, but no data loss is possible
> > from this because the kernel currently disallows inode allocation
> > from the bad cluster and so none of the inodes in the sparse cluster
> > can actually be used. Hence there is no possible data loss or other
> > metadata corruption possible from this situation, all we need to do
> > is ensure that it doesn't happen again once repair has done it's
> > work.
> 
> <shrug> How many systems are in this state?

Some. Maybe many. Unfortunately the number is largely
unquantifiable. However, when it happens it dumps corruption reports
dumped in the log, so I'd say that there aren't that many of them
out there because we aren't getting swamped with corruption reports.

> Would those users rather we
> fix the validation code in repair/scrub/wherever to allow ichunks that
> overrun the end of a runt AG?

Uh, the previous discussion ended at "considering inode chunks
overlapping the end of the runt AG as valid requires an incompat
feature flag as older kernels cannot access inodes in that
location". i.e. older kernels will flag those inodes as corrupt if
we don't add an incompat feature flag to indicate they are valid.

At that point, we have a situation where they are forced to upgrade
userspace tools to do anything with the filesytsem that the kernel
added the new incompat feature flag for on upgrade.

That's a much worse situation, because they might not realise they
need to upgrade all the userspace tools and disaster recovery
utilities to handle this new format that the kernel upgrade
introduced....

The repair/scrub/whatever code already detects and fix the issue by
removing the bad cluster from the runt AG. We just need to stop the
kernel creating the bad clusters again.

IOWs, it just simpler for everyone to fix the bug like this and
continue to consider the sparse inode cluster at the end of the AG
is invalid.

Alternatively, if users can grow the block device, then they can
simply round up the size of the block device to a whole inode
chunk. They don't need to run repair to fix the issue; the cluster
is now valid because a whole chunk will fit at end of the runt AG.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] xfs: miscellaneous bug fixes
  2024-11-12 22:05 [PATCH 0/3] xfs: miscellaneous bug fixes Dave Chinner
                   ` (3 preceding siblings ...)
  2024-11-12 23:59 ` [PATCH 0/3] xfs: miscellaneous bug fixes Darrick J. Wong
@ 2024-11-25 11:57 ` Carlos Maiolino
  4 siblings, 0 replies; 17+ messages in thread
From: Carlos Maiolino @ 2024-11-25 11:57 UTC (permalink / raw)
  To: linux-xfs, Dave Chinner

On Wed, 13 Nov 2024 09:05:13 +1100, Dave Chinner wrote:
> These are three bug fixes for recent issues.
> 
> The first is a repost of the original patch to prevent allocation of
> sparse inode clusters at the end of an unaligned runt AG. There
> was plenty of discussion over that fix here:
> 
> https://lore.kernel.org/linux-xfs/20241024025142.4082218-1-david@fromorbit.com/
> 
> [...]

Applied to for-next, thanks!

[1/3] xfs: fix sparse inode limits on runt AG
      commit: 13325333582d4820d39b9e8f63d6a54e745585d9
[2/3] xfs: delalloc and quota softlimit timers are incoherent
      commit: c9c293240e4351aa2678186cd88a08141fc6ce9e
[3/3] xfs: prevent mount and log shutdown race
      commit: a8581099604dfa609a34a3fac8ef5af0d300d2c1

Best regards,
-- 
Carlos Maiolino <cem@kernel.org>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 0/3] xfs: sparse inodes overlap end of filesystem
@ 2024-10-24  2:51 Dave Chinner
  2024-10-24  2:51 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
  0 siblings, 1 reply; 17+ messages in thread
From: Dave Chinner @ 2024-10-24  2:51 UTC (permalink / raw)
  To: linux-xfs

We have had a large number of recent reports about cloud filesystems
with "corrupt" inode records recently. They are all the same, and
feature a filesystem that has been grown from a small size to a
larger size (to 30G or 50G). In all cases, they have a very small
runt AG at the end of the filesystem.  In the case of the 30GB
filesystems, this is 1031 blocks long.

These filesystems start issuing corruption warnings when trying to
allocate an in a sparse cluster at block 1024 of the runt AG. At
this point, there shouldn't be a sparse inode cluster because there
isn't space to fit an entire inode chunk (8 blocks) at block 1024.
i.e. it is only 7 blocks from the end of the AG.

Hence the first bug is that we allowed allocation of a sparse inode
cluster in this location when it should not have occurred. The first
patch in the series addresses this.

However, there is actually nothing corrupt in the on-disk sparse
inode record or inode cluster at agbno 1024. It is a 32 inode
cluster, which means it is 4 blocks in length, so sits entirely
within the AG and every inode in the record is addressable and
accessible. The only thing we can't do is make the sparse inode
record whole - the inode allocation code cannot allocate another 4
blocks that span beyond the end of the AG. Hence this inode record
and cluster remain sparse until all the inodes in it are freed and
the cluster removed from disk.

The second bug is that we don't consider inodes beyond inode cluster
alignment at the end of an AG as being valid. When sparse inode
alignment is in use, we set the in-memory inode cluster alignment to
match the inode chunk alignment, and so the maximum valid inode
number is inode chunk aligned, not inode cluster aligned. Hence when
we have an inode cluster at the end of the AG - so the max inode
number is cluster aligned - we reject that entire cluster as being
invalid. 

As stated above, there is nothing corrupt about the sparse inode
cluster at the end of the AG, it just doesn't match an arbitrary
alignment validation restriction for inodes at the end of the AG.
Given we have production filesystems out there with sparse inode
clusters allocated with cluster alignment at the end of the AG, we
need to consider these inodes as valid and not error out with a
corruption report.  The second patch addresses this.

The third issue I found is that we never validate the
sb->sb_spino_align valid when we mount the filesystem. It could have
any value and we just blindly use it when calculating inode
allocation geometry. The third patch adds sb->sb_spino_align range
validation to the superblock verifier.

There is one question that needs to be resolved in this patchset: if
we take patch 2 to allow sparse inodes at the end of the AG, why
would we need the change in patch 1? Indeed, at this point I have to
ask why we even need the min/max agbno guidelines to the inode chunk
allocation as we end up allowing any aligned location in the AG to
be used by sparse inodes. i.e. if we take patch 2, then patch 1 is
unnecessary and now we can remove a bunch of code (min/max_agbno
constraints) from the allocator paths...

I'd prefer that we take the latter path: ignore the first patch.
This results in more flexible behaviour, allows existing filesystems
with this issue to work without needing xfs_repair to fix them, and
we get to remove complexity from the code.

Thoughts?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] xfs: fix sparse inode limits on runt AG
  2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
@ 2024-10-24  2:51 ` Dave Chinner
  0 siblings, 0 replies; 17+ messages in thread
From: Dave Chinner @ 2024-10-24  2:51 UTC (permalink / raw)
  To: linux-xfs

From: Dave Chinner <dchinner@redhat.com>

The runt AG at the end of a filesystem is almost always smaller than
the mp->m_sb.sb_agblocks. Unfortunately, when setting the max_agbno
limit for the inode chunk allocation, we do not take this into
account. This means we can allocate a sparse inode chunk that
overlaps beyond the end of an AG. When we go to allocate an inode
from that sparse chunk, the irec fails validation because the
agbno of the start of the irec is beyond valid limits for the runt
AG.

Prevent this from happening by taking into account the size of the
runt AG when allocating inode chunks. Also convert the various
checks for valid inode chunk agbnos to use xfs_ag_block_count()
so that they will also catch such issues in the future.

Fixes: 56d1115c9bc7 ("xfs: allocate sparse inode chunks on full chunk allocation failure")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/libxfs/xfs_ialloc.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c
index 271855227514..6258527315f2 100644
--- a/fs/xfs/libxfs/xfs_ialloc.c
+++ b/fs/xfs/libxfs/xfs_ialloc.c
@@ -855,7 +855,8 @@ xfs_ialloc_ag_alloc(
 		 * the end of the AG.
 		 */
 		args.min_agbno = args.mp->m_sb.sb_inoalignmt;
-		args.max_agbno = round_down(args.mp->m_sb.sb_agblocks,
+		args.max_agbno = round_down(xfs_ag_block_count(args.mp,
+							pag->pag_agno),
 					    args.mp->m_sb.sb_inoalignmt) -
 				 igeo->ialloc_blks;
 
@@ -2332,9 +2333,9 @@ xfs_difree(
 		return -EINVAL;
 	}
 	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
-	if (agbno >= mp->m_sb.sb_agblocks)  {
-		xfs_warn(mp, "%s: agbno >= mp->m_sb.sb_agblocks (%d >= %d).",
-			__func__, agbno, mp->m_sb.sb_agblocks);
+	if (agbno >= xfs_ag_block_count(mp, pag->pag_agno)) {
+		xfs_warn(mp, "%s: agbno >= xfs_ag_block_count (%d >= %d).",
+			__func__, agbno, xfs_ag_block_count(mp, pag->pag_agno));
 		ASSERT(0);
 		return -EINVAL;
 	}
@@ -2457,7 +2458,7 @@ xfs_imap(
 	 */
 	agino = XFS_INO_TO_AGINO(mp, ino);
 	agbno = XFS_AGINO_TO_AGBNO(mp, agino);
-	if (agbno >= mp->m_sb.sb_agblocks ||
+	if (agbno >= xfs_ag_block_count(mp, pag->pag_agno) ||
 	    ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino)) {
 		error = -EINVAL;
 #ifdef DEBUG
@@ -2467,11 +2468,12 @@ xfs_imap(
 		 */
 		if (flags & XFS_IGET_UNTRUSTED)
 			return error;
-		if (agbno >= mp->m_sb.sb_agblocks) {
+		if (agbno >= xfs_ag_block_count(mp, pag->pag_agno)) {
 			xfs_alert(mp,
 		"%s: agbno (0x%llx) >= mp->m_sb.sb_agblocks (0x%lx)",
 				__func__, (unsigned long long)agbno,
-				(unsigned long)mp->m_sb.sb_agblocks);
+				(unsigned long)xfs_ag_block_count(mp,
+							pag->pag_agno));
 		}
 		if (ino != XFS_AGINO_TO_INO(mp, pag->pag_agno, agino)) {
 			xfs_alert(mp,
-- 
2.45.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-11-25 11:57 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-12 22:05 [PATCH 0/3] xfs: miscellaneous bug fixes Dave Chinner
2024-11-12 22:05 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner
2024-11-12 23:15   ` Darrick J. Wong
2024-11-13  0:12     ` Dave Chinner
2024-11-13  0:24       ` Darrick J. Wong
2024-11-12 22:05 ` [PATCH 2/3] xfs: delalloc and quota softlimit timers are incoherent Dave Chinner
2024-11-12 23:48   ` Darrick J. Wong
2024-11-13  0:14     ` Dave Chinner
2024-11-13  8:48   ` Christoph Hellwig
2024-11-12 22:05 ` [PATCH 3/3] xfs: prevent mount and log shutdown race Dave Chinner
2024-11-12 23:58   ` Darrick J. Wong
2024-11-13  0:56     ` Dave Chinner
2024-11-13  8:50   ` Christoph Hellwig
2024-11-12 23:59 ` [PATCH 0/3] xfs: miscellaneous bug fixes Darrick J. Wong
2024-11-13  1:09   ` Dave Chinner
2024-11-25 11:57 ` Carlos Maiolino
  -- strict thread matches above, loose matches on Subject: below --
2024-10-24  2:51 [PATCH 0/3] xfs: sparse inodes overlap end of filesystem Dave Chinner
2024-10-24  2:51 ` [PATCH 1/3] xfs: fix sparse inode limits on runt AG Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox