From: Brian Foster <bfoster@redhat.com>
To: Long Li <leo.lilong@huawei.com>
Cc: Dave Chinner <david@fromorbit.com>,
djwong@kernel.org, chandanbabu@kernel.org,
linux-xfs@vger.kernel.org, yi.zhang@huawei.com,
houtao1@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH] xfs: ensure submit buffers on LSN boundaries in error handlers
Date: Mon, 15 Jan 2024 09:14:30 -0500 [thread overview]
Message-ID: <ZaU9xmrLRREwQ9Aa@bfoster> (raw)
In-Reply-To: <20240115133103.GA1665392@ceph-admin>
On Mon, Jan 15, 2024 at 09:31:03PM +0800, Long Li wrote:
> On Fri, Jan 12, 2024 at 01:42:32PM -0500, Brian Foster wrote:
> > On Fri, Jan 12, 2024 at 08:55:47PM +0800, Long Li wrote:
> > > On Thu, Jan 11, 2024 at 11:47:47AM +1100, Dave Chinner wrote:
> > > > On Thu, Dec 28, 2023 at 08:46:46PM +0800, Long Li wrote:
> > > > > In order to make sure that submits buffers on lsn boundaries in the
> > > > > abnormal paths, we need to check error status before submit buffers that
> > > > > have been added from the last record processed. If error status exist,
> > > > > buffers in the bufffer_list should be canceled.
> > > > >
> > > > > Canceling the buffers in the buffer_list directly isn't correct, unlike
> > > > > any other place where write list was canceled, these buffers has been
> > > > > initialized by xfs_buf_item_init() during recovery and held by buf
> > > > > item, buf items will not be released in xfs_buf_delwri_cancel(). If
> > > > > these buffers are submitted successfully, buf items assocated with
> > > > > the buffer will be released in io end process. So releasing buf item
> > > > > in write list cacneling process is needed.
> > > >
> > > > I still don't think this is correct.
> > > >
> > > > > Fixes: 50d5c8d8e938 ("xfs: check LSN ordering for v5 superblocks during recovery")
> > > > > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > > > > ---
> > > > > fs/xfs/xfs_buf.c | 2 ++
> > > > > fs/xfs/xfs_log_recover.c | 22 +++++++++++++---------
> > > > > 2 files changed, 15 insertions(+), 9 deletions(-)
> > > > >
> > > > > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > > > > index 8e5bd50d29fe..6a1b26aaf97e 100644
> > > > > --- a/fs/xfs/xfs_buf.c
> > > > > +++ b/fs/xfs/xfs_buf.c
> > > > > @@ -2075,6 +2075,8 @@ xfs_buf_delwri_cancel(
> > > > > xfs_buf_lock(bp);
> > > > > bp->b_flags &= ~_XBF_DELWRI_Q;
> > > > > xfs_buf_list_del(bp);
> > > > > + if (bp->b_log_item)
> > > > > + xfs_buf_item_relse(bp);
> > > > > xfs_buf_relse(bp);
> > > >
> > > > I still don't think this is safe. The buffer log item might still be
> > > > tracked in the AIL when the delwri list is cancelled, so the delwri
> > > > list cancelling cannot release the BLI without removing the item
> > > > from the AIL, too. The delwri cancelling walk really shouldn't be
> > > > screwing with AIL state, which means it can't touch the BLIs here.
> > > >
> > > > At minimum, it's a landmine for future users of
> > > > xfs_buf_delwri_cancel(). A quick look at the quotacheck code
> > > > indicates that it can cancel delwri lists that have BLIs in the AIL
> > > > (for newly allocated dquot chunks), so I think this is a real concern.
> > > >
> > > > This is one of the reasons for submitting the delwri list on error;
> > > > the IO completion code does all the correct cleanup of log items
> > > > including removing them from the AIL because the buffer is now
> > > > either clean or stale and no longer needs to be tracked by the AIL.
> > >
> > > Yes, it's not a safety solution.
> > >
> > > >
> > > > If the filesystem has been shut down, then delwri list submission
> > > > will error out all buffers on the list via IO submission/completion
> > > > and do all the correct cleanup automatically.
> > > >
> > > > I note that write IO errors during log recovery will cause immediate
> > > > shutdown of the filesytsem via xfs_buf_ioend_handle_error():
> > > >
> > > > /*
> > > > * We're not going to bother about retrying this during recovery.
> > > > * One strike!
> > > > */
> > > > if (bp->b_flags & _XBF_LOGRECOVERY) {
> > > > xfs_force_shutdown(mp, SHUTDOWN_META_IO_ERROR);
> > > > return false;
> > > > }
> > > >
> > > > So I'm guessing that the IO error injection error that caused this
> > > > failure was on a buffer read part way through recovering items.
> > > >
> > > > Can you confirm that the failure is only seen after read IO error
> > > > injection and that write IO error injection causes immediate
> > > > shutdown and so avoids the problem altogether?
> > >
> > > This problem reproduce very hard, we reproduce it only three times.
> > > There may be several mounts between writing buffer not on LSN boundaries
> > > and reporting free space btree corruption, I can't distinguish the
> > > violation happend in which mount during test. So judging by the message
> > > I've reprodced, I can't confirm that the failure is only seen after read
> > > IO error injection. Look at one of the kernel message I've reprodced,
> > > there are several mount fails before reporting free space btree corruption,
> > > the reasons of mount fail include read IO error and write IO error.
> > >
> > > [51555.801349] XFS (dm-3): Mounting V5 Filesystem
> > > [51555.982130] XFS (dm-3): Starting recovery (logdev: internal)
> > > [51558.153638] FAULT_INJECTION: forcing a failure.
> > > name fail_make_request, interval 20, probability 1, space 0, times -1
> > > [51558.153723] XFS (dm-3): log recovery read I/O error at daddr 0x3972 len 1 error -5
> > > [51558.165996] XFS (dm-3): log mount/recovery failed: error -5
> > > [51558.166880] XFS (dm-3): log mount failed
> > > [51558.410963] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
> > > [51558.410981] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
> > > [51558.413074] XFS (dm-3): Mounting V5 Filesystem
> > > [51558.595739] XFS (dm-3): Starting recovery (logdev: internal)
> > > [51559.592552] FAULT_INJECTION: forcing a failure.
> > > name fail_make_request, interval 20, probability 1, space 0, times -1
> > > [51559.593008] XFS (dm-3): metadata I/O error in "xfs_buf_ioend_handle_error+0x170/0x760 [xfs]" at daddr 0x1879e0 len 32 error 5
> > > [51559.593335] XFS (dm-3): Metadata I/O Error (0x1) detected at xfs_buf_ioend_handle_error+0x63c/0x760 [xfs] (fs/xfs/xfs_buf.c:1272). Shutting down filesystem.
> > > [51559.593346] XFS (dm-3): Please unmount the filesystem and rectify the problem(s)
> > > [51559.602833] XFS (dm-3): log mount/recovery failed: error -5
> > > [51559.603772] XFS (dm-3): log mount failed
> > > [51559.835690] XFS (dm-3): EXPERIMENTAL big timestamp feature in use. Use at your own risk!
> > > [51559.835708] XFS (dm-3): EXPERIMENTAL inode btree counters feature in use. Use at your own risk!
> > > [51559.837829] XFS (dm-3): Mounting V5 Filesystem
> > > [51560.024083] XFS (dm-3): Starting recovery (logdev: internal)
> > > [51562.155545] FAULT_INJECTION: forcing a failure.
> > > name fail_make_request, interval 20, probability 1, space 0, times -1
> > > [51562.445074] XFS (dm-3): Ending recovery (logdev: internal)
> > > [51563.553960] XFS (dm-3): Internal error ltbno + ltlen > bno at line 1976 of file fs/xfs/libxfs/xfs_alloc.c. Caller xfs_free_ag_extent+0x558/0xd80 [xfs]
> > > [51563.558629] XFS (dm-3): Corruption detected. Unmount and run xfs_repair
> > >
> > > >
> > > > If so, then all we need to do to handle instantiation side errors (EIO, ENOMEM,
> > > > etc) is this:
> > > >
> > > > /*
> > > > * Submit buffers that have been dirtied by the last record recovered.
> > > > */
> > > > if (!list_empty(&buffer_list)) {
> > > > if (error) {
> > > > /*
> > > > * If there has been an item recovery error then we
> > > > * cannot allow partial checkpoint writeback to
> > > > * occur. We might have multiple checkpoints with the
> > > > * same start LSN in this buffer list, and partial
> > > > * writeback of a checkpoint in this situation can
> > > > * prevent future recovery of all the changes in the
> > > > * checkpoints at this start LSN.
> > > > *
> > > > * Note: Shutting down the filesystem will result in the
> > > > * delwri submission marking all the buffers stale,
> > > > * completing them and cleaning up _XBF_LOGRECOVERY
> > > > * state without doing any IO.
> > > > */
> > > > xlog_force_shutdown(log, SHUTDOWN_LOG_IO_ERROR);
> > > > }
> > > > error2 = xfs_buf_delwri_submit(&buffer_list);
> > > > }
> > > >
> > >
> > > This solution is also used in our internally maintained linux branch,
> > > and after several months of testing, the problem no longer arises. It
> > > seems safe and reasonable enough.
> > >
> >
> > I assume you're referring to the xfs_buf_delwri_cancel() change..? If
>
> I'm not referring to the xfs_buf_delwri_cancel() solution, but to the
> use of the xlog_force_shutdown() solution. We believe that the shutdown
> solution is less risky because buffer list can be cleanup automatically
> via IO submission/completion, it would not change any other logic, so
> we've used it in our internally maintained linux branch.
>
> On the other hand, I thought it would be better to positively cancel
> the buffer list, so I sent it out, but I overlooked potential issues...
>
Ah, Ok. Sorry for the confusion. Thanks.
Brian
> > so, this is a valid data point but doesn't necessarily help explain
> > whether the change is correct in any other context. I/O cancel probably
> > doesn't happen often for one, and even if it does, it's not totally
> > clear if you're reproducing a situation where the item might be AIL
> > resident or not at the time (or it is and you have a use after free that
> > goes undetected). And even if none of that is relevant, that still
> > doesn't protect against future code changes if this code doesn't respect
> > the established bli lifecycle rules.
>
> Yes, agree with you.
>
> >
> > IIRC, the bli_refcount (sampled via xfs_buf_item_relse()) is not
> > necessarily an object lifecycle refcount. It simply reflects whether the
> > item exists in a transaction where it might eventually be dirtied. This
> > is somewhat tricky, but can also be surmised from some of the logic in
> > xfs_buf_item_put(), for example.
> >
> > IOW, I think in the simple (non-recovery) case of a buffer being read,
> > modified and committed by a transaction, the bli would eventually end up
> > in a state where bli_refcount == 0 but is still resident in the AIL
> > until eventually written back by xfsaild. That metadata writeback
> > completion is what eventually frees the bli via xfs_buf_item_done().
> >
> > So if I'm not mistaken wrt to the above example sequence, the
> > interesting question is if we suppose a buffer is in that intermediate
> > state of waiting for writeback, and then somebody were to hypothetically
> > execute a bit of code that simply added the associated buffer to a
> > delwri q and immediately cancelled it, what would happen with this
> > change in place? ISTM the remove would prematurely free the buffer/bli
> > while it's still resident in the AIL and pending writeback, thus
> > resulting in use-after-free or potential memory/list corruption, etc. Is
> > that not the case?
>
> Yes, if buf item still in the AIL, it is obviously not right to release
> the buf item.
>
> Thanks,
> Long Li
>
prev parent reply other threads:[~2024-01-15 14:13 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-28 12:46 [PATCH] xfs: ensure submit buffers on LSN boundaries in error handlers Long Li
2024-01-04 2:01 ` Darrick J. Wong
2024-01-04 13:03 ` Long Li
2024-01-07 22:13 ` Dave Chinner
2024-01-08 12:28 ` Long Li
2024-01-08 23:40 ` Dave Chinner
2024-01-09 14:52 ` Brian Foster
2024-01-09 21:43 ` Dave Chinner
2024-01-10 7:03 ` Long Li
2024-01-10 23:08 ` Dave Chinner
2024-01-11 14:54 ` Brian Foster
2024-01-10 8:38 ` Long Li
2024-01-11 0:47 ` Dave Chinner
2024-01-12 12:55 ` Long Li
2024-01-12 18:42 ` Brian Foster
2024-01-15 13:31 ` Long Li
2024-01-15 14:14 ` Brian Foster [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZaU9xmrLRREwQ9Aa@bfoster \
--to=bfoster@redhat.com \
--cc=chandanbabu@kernel.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=houtao1@huawei.com \
--cc=leo.lilong@huawei.com \
--cc=linux-xfs@vger.kernel.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox