From: Long Li <leo.lilong@huawei.com>
To: Brian Foster <bfoster@redhat.com>
Cc: <djwong@kernel.org>, <chandanbabu@kernel.org>,
<linux-xfs@vger.kernel.org>, <yi.zhang@huawei.com>,
<houtao1@huawei.com>, <yangerkun@huawei.com>,
Dave Chinner <david@fromorbit.com>
Subject: Re: [PATCH] xfs: ensure submit buffers on LSN boundaries in error handlers
Date: Wed, 10 Jan 2024 16:38:08 +0800 [thread overview]
Message-ID: <20240110083808.GA2075885@ceph-admin> (raw)
In-Reply-To: <ZZ1dtV1psURJnTOy@bfoster>
On Tue, Jan 09, 2024 at 09:52:37AM -0500, Brian Foster wrote:
> > > > > After commit 12818d24db8a ("xfs: rework log recovery to submit buffers on
> > > > > LSN boundaries") was introduced, we submit buffers on lsn boundaries during
> > > > > log recovery.
> > > >
> > > > Correct - we submit all the changes in a checkpoint for submission
> > > > before we start recovering the next checkpoint. That's because
> > > > checkpoints are supposed to be atomic units of change moving the
> > > > on-disk state from one change set to the next.
> > >
> > > Submit buffer on LSN boundaries not means submit buffer on checkpoint
> > > boundaries during recovery. In my understanding, One transaction on disk
> > > corresponds to a checkpoint, there's maybe multiple transaction on disk
> > > share same LSN, so sometimes we should ensure that submit multiple
> > > transation one time in such case. This rule was introduced by commit
> > > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN boundaries")
> >
> > Well, yes, that's exactly the situation that commit 12818d24db8a was
> > intended to handle:
> >
> > "If independent transactions share an LSN and both modify the
> > same buffer, log recovery can incorrectly skip updates and leave
> > the filesystem in an inconsisent state."
> >
> > Unfortunately, we didn't take into account the complexity of
> > mutliple transactions sharing the same start LSN in commit
> > 12818d24db8a ("xfs: rework log recovery to submit buffers on LSN
> > boundaries") back in 2016.
> >
> > Indeed, we didn't even know that there was a reliance on strict
> > start record LSN ordering in journal recovery until 2021:
> >
> > commit 68a74dcae6737c27b524b680e070fe41f0cad43a
> > Author: Dave Chinner <dchinner@redhat.com>
> > Date: Tue Aug 10 18:00:44 2021 -0700
> >
> > xfs: order CIL checkpoint start records
> >
> > Because log recovery depends on strictly ordered start records as
> > well as strictly ordered commit records.
> >
> > This is a zero day bug in the way XFS writes pipelined transactions
> > to the journal which is exposed by fixing the zero day bug that
> > prevents the CIL from pipelining checkpoints. This re-introduces
> > explicit concurrent commits back into the on-disk journal and hence
> > out of order start records.
> >
> > The XFS journal commit code has never ordered start records and we
> > have relied on strict commit record ordering for correct recovery
> > ordering of concurrently written transactions. Unfortunately, root
> > cause analysis uncovered the fact that log recovery uses the LSN of
> > the start record for transaction commit processing. Hence, whilst
> > the commits are processed in strict order by recovery, the LSNs
> > associated with the commits can be out of order and so recovery may
> > stamp incorrect LSNs into objects and/or misorder intents in the AIL
> > for later processing. This can result in log recovery failures
> > and/or on disk corruption, sometimes silent.
> >
> > Because this is a long standing log recovery issue, we can't just
> > fix log recovery and call it good. This still leaves older kernels
> > susceptible to recovery failures and corruption when replaying a log
> > from a kernel that pipelines checkpoints. There is also the issue
> > that in-memory ordering for AIL pushing and data integrity
> > operations are based on checkpoint start LSNs, and if the start LSN
> > is incorrect in the journal, it is also incorrect in memory.
> >
> > Hence there's really only one choice for fixing this zero-day bug:
> > we need to strictly order checkpoint start records in ascending
> > sequence order in the log, the same way we already strictly order
> > commit records.
> >
> > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > Reviewed-by: Darrick J. Wong <djwong@kernel.org>
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> >
> > Essentially, the problem now is that even with strictly ordered
> > start records for checkpoints, checkpoints with the same start LSN
> > interfere with each other in recovery because recovery is not
> > aware of the fact that we can have multiple checkpoints that start
> > with the same LSN.
> >
> > This is another zero-day issue with the journal and log recovery;
> > originally there was no "anti-recovery" logic in the journal like we
> > have now with LSNs to prevent recovery from taking metadata state
> > backwards. Hence log recovery just always replayed every change
> > that was in the journal from start to finish and so there was never
> > a problem with having multiple start records in the same log record.
> >
> > However, this was known to cause problems with inodes and data vs
> > metadata sequencing and non-transactional inode metadata updates
> > (e.g. inode size), so a "flush iteration" counter was added to
> > inodes in 2003:
> >
> > commit 6ed3d868e47470a301b49f1e8626972791206f50
> > Author: Steve Lord <lord@sgi.com>
> > Date: Wed Aug 6 21:17:05 2003 +0000
> >
> > Add versioning to the on disk inode which we increment on each
> > flush call. This is used during recovery to avoid replaying an
> > older copy of the inode from the log. We can do this without
> > versioning the filesystem as the pad space we borrowed was
> > always zero and will be ignored by old kernels.
> > During recovery, do not replay an inode log record which is older
> > than the on disk copy. Check for wrapping in the counter.
> >
> > This was never fully reliable, and there was always issues with
> > this counter because inode changes weren't always journalled nor
> > were cache flushes used to ensure unlogged inode metadata updates
> > reached stable storage.
> >
> > The LSN sequencing was added to the v5 format to ensure metadata
> > never goes backwards in time on disk without fail. The issue you've
> > uncovered shows that we still have issues stemming from the
> > original journal recovery algorithm that was not designed with
> > anti-recovery protections in mind from the start.
> >
> > The problem we need to solve is how we preserve the necessary
> > anti-recovery behaviour when we have multiple checkpoints that can
> > have the same LSN and objects are updated immediately on recovery?
> >
> > I suspect that we need to track that the checkpoint being recovered
> > has a duplicate start LSN (i.e. in the struct xlog_recover) and
> > modify the anti-recovery LSN check to take this into account. i.e.
> > we can really only skip recovery of the first checkpoint at any
> > given LSN because we cannot disambiguate an LSN updated by the first
> > checkpoint at that LSN and the metadata already being up to date on
> > disk in the second and subsequent checkpoints at the same start
> > LSN.
> >
> > There are likely to be other solutions - anyone have a different
> > idea on how we might address this?
> >
>
> It's been a while since I've looked at any of this and I haven't waded
> through all of the details, so I could easily be missing something, but
> what exactly is wrong with the approach of the patch as posted?
>
> Commit 12818d24db ("xfs: rework log recovery to submit buffers on LSN
> boundaries") basically created a new invariant for log recovery where
> buffers are allowed to be written only once per LSN. The risk otherwise
> is that a subsequent update with a matching LSN would not be correctly
> applied due to the v5 LSN ordering rules. Since log recovery processes
> transactions (using terminology/granularity as defined by the
> implementation of xlog_recover_commit_trans()), this required changes to
> accommodate any of the various possible runtime logging scenarios that
> could cause a buffer to have multiple entries in the log associated with
> a single LSN, the details of which were orthogonal to the fix.
>
> The functional change therefore was that rather than to process and
> submit "transactions" in sequence during recovery, the pending buffer
> list was lifted to a higher level in the code, a tracking field was
> added for the "current LSN" of log recovery, and only once we cross a
> current LSN boundary are we allowed to submit the set of buffers
> processed for the prior LSN. The reason for this logic is that seeing
> the next LSN was really the only way we know we're done processing items
> for a particular LSN.
>
> If I understand the problem description correctly, the issue here is
> that if an error is encountered in the middle of processing items for
> some LSN A, we bail out of recovery and submit the pending buffers on
> the way out. If we haven't completed processing all items for LSN A
> before failing, however, then we've just possibly violated the "write
> once per LSN" invariant that protects from corrupting the fs. This is
> because the writeback permanently updates metadata LSNs (assuming that
> I/O doesn't fail), which means if recovery retries from the same point
> the next time around and progresses to find a second instance of an
> already written buffer in LSN A, it will exhibit the same general
> behavior from before the write once invariant was introduced. IOW,
> there's still a vector to the original problematic multi-write per LSN
> behavior through multiple recovery attempts (hence the simulated I/O
> error to reproduce).
>
> Long Li, am I following the problem description correctly? I've not
> fully reviewed it, but if so, the proposed solution seems fairly sane
> and logical to me. (And nice work tracking this down, BTW, regardless of
> whether this is the final solution. ;).
>
Hi, Brian, your description is correct for me, and it is clear and easy
to understand. Thanks for your encouragement of my work.
Thanks,
Long Li
next prev parent reply other threads:[~2024-01-10 8:35 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-12-28 12:46 [PATCH] xfs: ensure submit buffers on LSN boundaries in error handlers Long Li
2024-01-04 2:01 ` Darrick J. Wong
2024-01-04 13:03 ` Long Li
2024-01-07 22:13 ` Dave Chinner
2024-01-08 12:28 ` Long Li
2024-01-08 23:40 ` Dave Chinner
2024-01-09 14:52 ` Brian Foster
2024-01-09 21:43 ` Dave Chinner
2024-01-10 7:03 ` Long Li
2024-01-10 23:08 ` Dave Chinner
2024-01-11 14:54 ` Brian Foster
2024-01-10 8:38 ` Long Li [this message]
2024-01-11 0:47 ` Dave Chinner
2024-01-12 12:55 ` Long Li
2024-01-12 18:42 ` Brian Foster
2024-01-15 13:31 ` Long Li
2024-01-15 14:14 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240110083808.GA2075885@ceph-admin \
--to=leo.lilong@huawei.com \
--cc=bfoster@redhat.com \
--cc=chandanbabu@kernel.org \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=houtao1@huawei.com \
--cc=linux-xfs@vger.kernel.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox