From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Long Li <leo.lilong@huawei.com>,
Dave Chinner <dchinner@redhat.com>,
Chandan Babu R <chandan.babu@oracle.com>,
Eric Sandeen <sandeen@redhat.com>,
Bill O'Donnell <billodo@redhat.com>,
linux-xfs@vger.kernel.org, yi.zhang@huawei.com,
houtao1@huawei.com, guoxuenan@huawei.com
Subject: Re: [PATCH v1] xfs: fix sb write verify for lazysbcount
Date: Sun, 23 Oct 2022 21:07:42 -0700 [thread overview]
Message-ID: <Y1YPjkiiN3FyMBfG@magnolia> (raw)
In-Reply-To: <20221022211613.GW3600936@dread.disaster.area>
On Sun, Oct 23, 2022 at 08:16:13AM +1100, Dave Chinner wrote:
> On Sat, Oct 22, 2022 at 08:01:25PM +0800, Long Li wrote:
> > On Fri, Oct 21, 2022 at 07:14:28PM -0700, Darrick J. Wong wrote:
> > > On Sat, Oct 22, 2022 at 10:03:45AM +0800, Long Li wrote:
> > > > When lazysbcount is enabled, multiple threads stress test the xfs report
> > > > the following problems:
>
> We've had lazy sb counters for 15 years and just about every XFS
> filesystem in production uses them, so providing us with some idea
> of the scope of the problem and how to reproduce it would be greatly
> appreciated.
>
> What stress test are you running? What filesystem config does it
> manifest on (other than lazysbcount=1)? How long does the stress
> test run for, and where/why does log recovery get run in this stress
> test?
>
> > > > XFS (loop0): SB summary counter sanity check failed
> > > > XFS (loop0): Metadata corruption detected at xfs_sb_write_verify
> > > > +0x13b/0x460, xfs_sb block 0x0
> > > > XFS (loop0): Unmount and run xfs_repair
> > > > XFS (loop0): First 128 bytes of corrupted metadata buffer:
> > > > 00000000: 58 46 53 42 00 00 10 00 00 00 00 00 00 28 00 00 XFSB.........(..
> > > > 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
> > > > 00000020: 69 fb 7c cd 5f dc 44 af 85 74 e0 cc d4 e3 34 5a i.|._.D..t....4Z
> > > > 00000030: 00 00 00 00 00 20 00 06 00 00 00 00 00 00 00 80 ..... ..........
> > > > 00000040: 00 00 00 00 00 00 00 81 00 00 00 00 00 00 00 82 ................
> > > > 00000050: 00 00 00 01 00 0a 00 00 00 00 00 04 00 00 00 00 ................
> > > > 00000060: 00 00 0a 00 b4 b5 02 00 02 00 00 08 00 00 00 00 ................
> > > > 00000070: 00 00 00 00 00 00 00 00 0c 09 09 03 14 00 00 19 ................
> > > > XFS (loop0): Corruption of in-memory data (0x8) detected at _xfs_buf_ioapply
> > > > +0xe1e/0x10e0 (fs/xfs/xfs_buf.c:1580). Shutting down filesystem.
> > > > XFS (loop0): Please unmount the filesystem and rectify the problem(s)
> > > > XFS (loop0): log mount/recovery failed: error -117
> > > > XFS (loop0): log mount failed
> > > >
> > > > The cause of the problem is that during the log recovery process, incorrect
> > > > icount and ifree are recovered from the log and fail to pass the size check
> > >
> > > Are you saying that the log contained a transaction in which ifree >
> > > icount?
> >
> > Yes, this situation is possible. For example consider the following sequence:
> >
> > CPU0 CPU1
> > xfs_log_sb xfs_trans_unreserve_and_mod_sb
> > ---------- ------------------------------
> > percpu_counter_sum(&mp->m_icount)
> > percpu_counter_add(&mp->m_icount, idelta)
> > percpu_counter_add_batch(&mp->m_icount,
> > idelta, XFS_ICOUNT_BATCH)
> > percpu_counter_sum(&mp->m_ifree)
>
> What caused the xfs_log_sb() to be called? Very few things
> actually log the superblock this way at runtime - it's generally
> only logged directly like this when a feature bit changes during a
> transaction (rare) or at a synchronisation point when everything
> else is idle and there's no chance of a race like this occurring...
>
> I can see a couple of routes to this occurring via feature bit
> modification, but I don't see them being easy to hit or something
> that would exist for very long in the journal. Hence I'm wondering
> if there should be runtime protection for xfs_log_sb() to avoid
> these problems....
Maybe. Or perhaps we sample m_i{count,free} until they come up with a
value that will pass the verifier, and only then log the new values to
the primary super xfs_buf?
> > > > in xfs_validate_sb_write().
> > > >
> > > > With lazysbcount is enabled, There is no additional lock protection for
> > > > reading m_ifree and m_icount in xfs_log_sb(), if other threads modifies
> > > > the m_ifree between the read m_icount and the m_ifree, this will make the
> > > > m_ifree larger than m_icount and written to the log. If we have an unclean
> > > > shutdown, this will be corrected by xfs_initialize_perag_data() rebuilding
> > > > the counters from the AGF block counts, and the correction is later than
> > > > log recovery. During log recovery, incorrect ifree/icount may be restored
> > > > from the log and written to the super block, since ifree and icount have
> > > > not been corrected at this time, the relationship between ifree and icount
> > > > cannot be checked in xfs_validate_sb_write().
> > > >
> > > > So, don't check the size between ifree and icount in xfs_validate_sb_write()
> > > > when lazysbcount is enabled.
> > >
> > > Um, doesn't that neuter a sanity check on all V5 filesystems?
> >
> > Yes, such modifications don't look like the best way, all sb write checks
> > will be affect. Maybe it can increase the judgment of clean mount and reduce
> > the scope of influence, but this requires setting the XFS_OPSTATE_CLEAN after
> > re-initialise incore superblock counters, like this:
>
> I'm not sure that silencing the warning and continuing log recovery
> with an invalid accounting state is the best way to proceed. We
> haven't yet recovered unlinked inodes at the time this warning
> fires. If ifree really is larger than icount, then we've got a
> real problem at this point in log recovery.
>
> Hence I suspect that we should be looking at fixing the runtime race
> condition that leads to the problem, not trying to work around
> inconsistency in the recovery code.
Agreed. Higher level code shouldn't be storing garbage ifree/icount to
the primary superblock xfs_buf, period.
--D
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2022-10-24 4:07 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-22 2:03 [PATCH v1] xfs: fix sb write verify for lazysbcount Long Li
2022-10-22 2:14 ` Darrick J. Wong
2022-10-22 12:01 ` Long Li
2022-10-22 21:16 ` Dave Chinner
2022-10-24 4:07 ` Darrick J. Wong [this message]
2022-10-24 5:43 ` Dave Chinner
2022-10-24 12:28 ` Long Li
2022-10-24 5:06 ` Long Li
2022-10-25 9:15 ` [PATCH v2] " Long Li
2022-10-25 18:16 ` Darrick J. Wong
2022-10-26 9:13 ` Long Li
2022-10-26 18:52 ` Darrick J. Wong
2022-10-27 13:25 ` Long Li
2022-10-27 16:05 ` Darrick J. Wong
2022-10-29 7:16 ` Long Li
2022-10-30 22:04 ` Dave Chinner
2022-10-31 14:17 ` Long Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y1YPjkiiN3FyMBfG@magnolia \
--to=djwong@kernel.org \
--cc=billodo@redhat.com \
--cc=chandan.babu@oracle.com \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=guoxuenan@huawei.com \
--cc=houtao1@huawei.com \
--cc=leo.lilong@huawei.com \
--cc=linux-xfs@vger.kernel.org \
--cc=sandeen@redhat.com \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox