Re: [PATCH v1] xfs: fix sb write verify for lazysbcount

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Long Li <leo.lilong@huawei.com>
To: Dave Chinner <david@fromorbit.com>
Cc: "Darrick J. Wong" <djwong@kernel.org>,
	Dave Chinner <dchinner@redhat.com>,
	Chandan Babu R <chandan.babu@oracle.com>,
	Eric Sandeen <sandeen@redhat.com>,
	Bill O'Donnell <billodo@redhat.com>, <linux-xfs@vger.kernel.org>,
	<yi.zhang@huawei.com>, <houtao1@huawei.com>,
	<guoxuenan@huawei.com>
Subject: Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
Date: Mon, 24 Oct 2022 13:06:59 +0800	[thread overview]
Message-ID: <20221024050659.GA338324@ceph-admin> (raw)
In-Reply-To: <20221022211613.GW3600936@dread.disaster.area>

On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote:
> On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote:
> > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote:
> > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> > > > When lazysbcount is enabled, multiple threads stress test the xfs report
> > > > the following problems:
> 
> We've had lazy sb counters for 15 years and just about every XFS
> filesystem in production uses them, so providing us with some idea
> of the scope of the problem and how to reproduce it would be greatly
> appreciated.

Ok, here is my simplified script that could reproducer the bug:

#!/bin/bash

device=/dev/sda
testdir=/mnt/test
round=0

function fail()
{
	echo "$*"
	exit 1
}

while [ $round -lt 10000 ]
do
	echo "******* round $round ********"
	mkfs.xfs -f $device
	mkdir -p $testdir
	mount $device $testdir || fail "mount failed!"
	fsstress -d $testdir -l 0 -n 10000 -p 4 >/dev/null &
	sleep 4
	killall -w fsstress
	umount $testdir
	xfs_repair -e $device > /dev/null
	if [ $? -eq 2 ];then
		echo "ERR CODE 2: Dirty log exception encountered during repair."
		exit 1
	fi
	sleep 1
	round=$(($round+1))
done

> 
> What stress test are you running? What filesystem config does it
> manifest on (other than lazysbcount=1)?  How long does the stress
> test run for, and where/why does log recovery get run in this stress
> test?

I tested with the fsstress tool and used the default mount options. 
More configuration are as follows:

CONFIG_XFS_FS=y
CONFIG_XFS_SUPPORT_V4=y
CONFIG_XFS_QUOTA=y
CONFIG_XFS_POSIX_ACL=y
CONFIG_XFS_RT=y
CONFIG_XFS_DEBUG=y
CONFIG_XFS_ASSERT_FATAL=y

xfs_db> info
meta-data=/dev/sda               isize=512    agcount=4, agsize=65536 blks
         =                       sectsz=512   attr=2, projid32bit=1
	 =                       crc=1        finobt=1, sparse=1, rmapbt=0
	 =                       reflink=1
data     =                       bsize=4096   blocks=262144, imaxpct=25
	 =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
	 =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

With the above test script, it will take several days to reproduce the bug.
If we add a delay in xfs_log_sb(), this will quickly reproduce the problem:

 	if (xfs_has_lazysbcount(mp)) {
 		mp->m_sb.sb_icount = percpu_counter_sum(&mp->m_icount);
+		udelay(2000);
 		mp->m_sb.sb_ifree = percpu_counter_sum(&mp->m_ifree);
 		mp->m_sb.sb_fdblocks = percpu_counter_sum(&mp->m_fdblocks);
 	}

> 
> > > > XFS (loop0): SB summary counter sanity check failed
> > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> > > > 	     +0x13b/0x460, xfs_sb block 0x0
> > > > XFS (loop0): Unmount and run xfs_repair
> > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00  XFSB.........(..
> > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a  i.|._.D..t....4Z
> > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80  ..... ..........
> > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82  ................
> > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00  ................
> > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00  ................
> > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19  ................
> > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > 	+0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580).  Shutting down filesystem.
> > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > XFS (loop0): log mount/recovery failed: error -117
> > > > XFS (loop0): log mount failed
> > > > 
> > > > The cause of the problem is that during the log recovery process, incorrect
> > > > icount and ifree are recovered from the log and fail to pass the size check
> > > 
> > > Are you saying that the log contained a transaction in which ifree >
> > > icount?
> > 
> > Yes, this situation is possible. For example consider the following sequence:
> > 
> >  CPU0				    CPU1
> >  xfs_log_sb			    xfs_trans_unreserve_and_mod_sb
> >  ----------			    ------------------------------
> >  percpu_counter_sum(&mp->m_icount)
> > 				    percpu_counter_add(&mp->m_icount, idelta)
> > 				    percpu_counter_add_batch(&mp->m_icount,
> > 						  idelta, XFS_ICOUNT_BATCH)
> >  percpu_counter_sum(&mp->m_ifree)
> 
> What caused the xfs_log_sb() to be called? Very few things
> actually log the superblock this way at runtime - it's generally
> only logged directly like this when a feature bit changes during a
> transaction (rare) or at a synchronisation point when everything
> else is idle and there's no chance of a race like this occurring...
> 
> I can see a couple of routes to this occurring via feature bit
> modification, but I don't see them being easy to hit or something
> that would exist for very long in the journal. Hence I'm wondering
> if there should be runtime protection for xfs_log_sb() to avoid
> these problems....
> 
> > > > in xfs_validate_sb_write().
> > > > 
> > > > With lazysbcount is enabled, There is no additional lock protection for
> > > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies
> > > > the m_ifree between the read m_icount and the m_ifree, this will make the
> > > > m_ifree larger than m_icount and written to the log. If we have an unclean
> > > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding
> > > > the counters from the AGF block counts, and the correction is later than
> > > > log recovery. During log recovery, incorrect ifree/icount may be restored
> > > > from the log and written to the super block, since ifree and icount have
> > > > not been corrected at this time, the relationship between ifree and icount
> > > > cannot be checked in xfs_validate_sb_write().
> > > > 
> > > > So, don't check the size between ifree and icount in xfs_validate_sb_write()
> > > > when lazysbcount is enabled.
> > > 
> > > Um, doesn't that neuter a sanity check on all V5 filesystems?
> >
> > Yes, such modifications don't look like the best way, all sb write checks 
> > will be affect. Maybe it can increase the judgment of clean mount and reduce
> > the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after
> > re-initialise incore superblock counters, like this:
> 
> I'm not sure that silencing the warning and continuing log recovery
> with an invalid accounting state is the best way to proceed. We
> haven't yet recovered unlinked inodes at the time this warning
> fires. If ifree really is larger than icount, then we've got a
> real problem at this point in log recovery.
> 
> Hence I suspect that we should be looking at fixing the runtime race
> condition that leads to the problem, not trying to work around
> inconsistency in the recovery code.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

next prev parent reply	other threads:[~2022-10-24  4:45 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-22  2:03 [PATCH v1] xfs: fix sb write verify for lazysbcount Long Li
2022-10-22  2:14 ` Darrick J. Wong
2022-10-22 12:01   ` Long Li
2022-10-22 21:16     ` Dave Chinner
2022-10-24  4:07       ` Darrick J. Wong
2022-10-24  5:43         ` Dave Chinner
2022-10-24 12:28           ` Long Li
2022-10-24  5:06       ` Long Li [this message]
2022-10-25  9:15 ` [PATCH v2] " Long Li
2022-10-25 18:16   ` Darrick J. Wong
2022-10-26  9:13     ` Long Li
2022-10-26 18:52       ` Darrick J. Wong
2022-10-27 13:25         ` Long Li
2022-10-27 16:05           ` Darrick J. Wong
2022-10-29  7:16             ` Long Li
2022-10-30 22:04               ` Dave Chinner
2022-10-31 14:17                 ` Long Li

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20221024050659.GA338324@ceph-admin \
    --to=leo.lilong@huawei.com \
    --cc=billodo@redhat.com \
    --cc=chandan.babu@oracle.com \
    --cc=david@fromorbit.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=guoxuenan@huawei.com \
    --cc=houtao1@huawei.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=sandeen@redhat.com \
    --cc=yi.zhang@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.